* BUG: scheduling while atomic in blk_mq codepath? @ 2014-06-19 15:35 Theodore Ts'o 2014-06-19 15:59 ` Jens Axboe 0 siblings, 1 reply; 21+ messages in thread From: Theodore Ts'o @ 2014-06-19 15:35 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-kernel, linux-scsi While trying to bisect some problems which were introduced sometime between 3.15 and 3.16-rc1 (specifically, (1) reads to a block device at offset 262144 * 4k are failing with a short read, and (2) block device reads are sometimes causing the entire kernel to hang), the following BUG got hit. [ 0.000000] Linux version 3.15.0-rc8-06047-gaaeb255 (tytso@closure) (gcc version 4.8.3 (Debian 4.8.3-2) ) #1902 SMP Thu Jun 19 11:16:10 EDT 2014 [....] Checking file systems...fsck from util-linux 2.20.1 /dev/vdg was not cleanly unmounted, check forced. [ 4.161703] BUG: scheduling while atomic: fsck.ext4/2072/0x0000000266.5% [ 4.163673] no locks held by fsck.ext4/2072. [ 4.164318] Modules linked in: [ 4.164845] CPU: 0 PID: 2072 Comm: fsck.ext4 Not tainted 3.15.0-rc8-06047-gaaeb255 #1902 [ 4.166047] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 4.166917] 00000000 00000000 f52c5ba0 c0832655 f5158610 f52c5bac c082f88a f6501e40 [ 4.168188] f52c5c20 c08362ca c0eb3e40 c0eb3e40 374d3933 00000001 0396a8da 00000000 [ 4.169474] f5158610 f51f1674 f4f46a00 f52c5be4 c015dd4b f4f46a00 f52c5bf0 c015dd5e [ 4.170781] Call Trace: [ 4.171159] [<c0832655>] dump_stack+0x48/0x60 [ 4.171838] [<c082f88a>] __schedule_bug+0x5c/0x6d [ 4.172572] [<c08362ca>] __schedule+0x61/0x65a [ 4.173228] [<c015dd4b>] ? kvm_clock_read+0x1f/0x29 [ 4.173977] [<c015dd5e>] ? kvm_clock_get_cycles+0x9/0xc [ 4.174771] [<c01b4cb9>] ? timekeeping_get_ns.constprop.14+0x10/0x56 [ 4.175701] [<c0836922>] schedule+0x5f/0x61 [ 4.176345] [<c0836aa2>] io_schedule+0x50/0x67 [ 4.177060] [<c0423b2d>] bt_get+0xaf/0xd1 [ 4.177677] [<c0198282>] ? wake_up_atomic_t+0x1f/0x1f [ 4.178444] [<c0423bfd>] blk_mq_get_tag+0x26/0x82 [ 4.179158] [<c0420f14>] __blk_mq_alloc_request+0x2a/0x169 [ 4.180022] [<c04222b5>] blk_mq_map_request+0x137/0x1e3 [ 4.180825] [<c0422f89>] blk_sq_make_request+0x82/0x145 [ 4.181630] [<c041a687>] generic_make_request+0x82/0xb5 [ 4.182430] [<c041a7aa>] submit_bio+0xf0/0x109 [ 4.183113] [<c019e97c>] ? trace_hardirqs_on_caller+0x14e/0x169 [ 4.184019] [<c025de72>] _submit_bh+0x1ad/0x1ca [ 4.184661] [<c025de9e>] submit_bh+0xf/0x11 [ 4.185267] [<c025f5c9>] block_read_full_page+0x1e2/0x1f2 [ 4.186073] [<c025f8cd>] ? I_BDEV+0xa/0xa [ 4.186695] [<c020ad30>] ? __lru_cache_add+0x24/0x46 [ 4.187452] [<c020af13>] ? lru_cache_add+0xd/0xf [ 4.188130] [<c025fc04>] blkdev_readpage+0x14/0x16 [ 4.188832] [<c0209adf>] __do_page_cache_readahead+0x1c0/0x1eb [ 4.189704] [<c0209cb9>] ondemand_readahead+0x1af/0x1b9 [ 4.190508] [<c0209d22>] page_cache_async_readahead+0x5f/0x6a [ 4.191424] [<c0202370>] generic_file_aio_read+0x226/0x4f4 [ 4.192272] [<c0260841>] blkdev_aio_read+0x90/0x9e [ 4.193017] [<c02385cd>] do_sync_read+0x52/0x79 [ 4.193731] [<c023857b>] ? fdput_pos+0x25/0x25 [ 4.194412] [<c0238d27>] vfs_read+0x72/0xd1 [ 4.195064] [<c02391da>] SyS_read+0x49/0x7c [ 4.195700] [<c083a0c9>] syscall_call+0x7/0xb [ 4.196385] [<c0830000>] ? print_usage_bug+0xcd/0x18e Is any of these known problems? This is blocking me from doing any kind of testing at the moment... (these problems are showing up while running KVM using virtio devices). - Ted ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: BUG: scheduling while atomic in blk_mq codepath? 2014-06-19 15:35 BUG: scheduling while atomic in blk_mq codepath? Theodore Ts'o @ 2014-06-19 15:59 ` Jens Axboe 2014-06-19 16:08 ` Theodore Ts'o 0 siblings, 1 reply; 21+ messages in thread From: Jens Axboe @ 2014-06-19 15:59 UTC (permalink / raw) To: Theodore Ts'o, linux-kernel, linux-scsi On 2014-06-19 08:35, Theodore Ts'o wrote: > While trying to bisect some problems which were introduced sometime > between 3.15 and 3.16-rc1 (specifically, (1) reads to a block device > at offset 262144 * 4k are failing with a short read, and (2) block > device reads are sometimes causing the entire kernel to hang), the > following BUG got hit. > > [ 0.000000] Linux version 3.15.0-rc8-06047-gaaeb255 (tytso@closure) (gcc version 4.8.3 (Debian 4.8.3-2) ) #1902 SMP Thu Jun 19 11:16:10 EDT 2014 > > [....] Checking file systems...fsck from util-linux 2.20.1 > /dev/vdg was not cleanly unmounted, check forced. > [ 4.161703] BUG: scheduling while atomic: fsck.ext4/2072/0x0000000266.5% > [ 4.163673] no locks held by fsck.ext4/2072. > [ 4.164318] Modules linked in: > [ 4.164845] CPU: 0 PID: 2072 Comm: fsck.ext4 Not tainted 3.15.0-rc8-06047-gaaeb255 #1902 > [ 4.166047] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > [ 4.166917] 00000000 00000000 f52c5ba0 c0832655 f5158610 f52c5bac c082f88a f6501e40 > [ 4.168188] f52c5c20 c08362ca c0eb3e40 c0eb3e40 374d3933 00000001 0396a8da 00000000 > [ 4.169474] f5158610 f51f1674 f4f46a00 f52c5be4 c015dd4b f4f46a00 f52c5bf0 c015dd5e > [ 4.170781] Call Trace: > [ 4.171159] [<c0832655>] dump_stack+0x48/0x60 > [ 4.171838] [<c082f88a>] __schedule_bug+0x5c/0x6d > [ 4.172572] [<c08362ca>] __schedule+0x61/0x65a > [ 4.173228] [<c015dd4b>] ? kvm_clock_read+0x1f/0x29 > [ 4.173977] [<c015dd5e>] ? kvm_clock_get_cycles+0x9/0xc > [ 4.174771] [<c01b4cb9>] ? timekeeping_get_ns.constprop.14+0x10/0x56 > [ 4.175701] [<c0836922>] schedule+0x5f/0x61 > [ 4.176345] [<c0836aa2>] io_schedule+0x50/0x67 > [ 4.177060] [<c0423b2d>] bt_get+0xaf/0xd1 > [ 4.177677] [<c0198282>] ? wake_up_atomic_t+0x1f/0x1f > [ 4.178444] [<c0423bfd>] blk_mq_get_tag+0x26/0x82 > [ 4.179158] [<c0420f14>] __blk_mq_alloc_request+0x2a/0x169 > [ 4.180022] [<c04222b5>] blk_mq_map_request+0x137/0x1e3 > [ 4.180825] [<c0422f89>] blk_sq_make_request+0x82/0x145 > [ 4.181630] [<c041a687>] generic_make_request+0x82/0xb5 > [ 4.182430] [<c041a7aa>] submit_bio+0xf0/0x109 > [ 4.183113] [<c019e97c>] ? trace_hardirqs_on_caller+0x14e/0x169 > [ 4.184019] [<c025de72>] _submit_bh+0x1ad/0x1ca > [ 4.184661] [<c025de9e>] submit_bh+0xf/0x11 > [ 4.185267] [<c025f5c9>] block_read_full_page+0x1e2/0x1f2 > [ 4.186073] [<c025f8cd>] ? I_BDEV+0xa/0xa > [ 4.186695] [<c020ad30>] ? __lru_cache_add+0x24/0x46 > [ 4.187452] [<c020af13>] ? lru_cache_add+0xd/0xf > [ 4.188130] [<c025fc04>] blkdev_readpage+0x14/0x16 > [ 4.188832] [<c0209adf>] __do_page_cache_readahead+0x1c0/0x1eb > [ 4.189704] [<c0209cb9>] ondemand_readahead+0x1af/0x1b9 > [ 4.190508] [<c0209d22>] page_cache_async_readahead+0x5f/0x6a > [ 4.191424] [<c0202370>] generic_file_aio_read+0x226/0x4f4 > [ 4.192272] [<c0260841>] blkdev_aio_read+0x90/0x9e > [ 4.193017] [<c02385cd>] do_sync_read+0x52/0x79 > [ 4.193731] [<c023857b>] ? fdput_pos+0x25/0x25 > [ 4.194412] [<c0238d27>] vfs_read+0x72/0xd1 > [ 4.195064] [<c02391da>] SyS_read+0x49/0x7c > [ 4.195700] [<c083a0c9>] syscall_call+0x7/0xb > [ 4.196385] [<c0830000>] ? print_usage_bug+0xcd/0x18e > > Is any of these known problems? This is blocking me from doing any > kind of testing at the moment... (these problems are showing up while > running KVM using virtio devices). I believe you already reported this issue a while back, and it should be fixed by commit cb96a42c in the kernel. The other issue, not sure, not a lot of detail. It may be fixed by the pull request I sent out yesterday. You can try pulling in: git://git.kernel.dk/linux-block.git for-linus and see if that fixes it for you. -- Jens Axboe ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: BUG: scheduling while atomic in blk_mq codepath? 2014-06-19 15:59 ` Jens Axboe @ 2014-06-19 16:08 ` Theodore Ts'o 2014-06-19 16:21 ` Theodore Ts'o 0 siblings, 1 reply; 21+ messages in thread From: Theodore Ts'o @ 2014-06-19 16:08 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-kernel, linux-scsi On Thu, Jun 19, 2014 at 08:59:26AM -0700, Jens Axboe wrote: > I believe you already reported this issue a while back, and it should be > fixed by commit cb96a42c in the kernel. Ah yes, I had forgotten. Thanks for the reminder! > The other issue, not sure, not a lot of detail. It may be fixed by the pull > request I sent out yesterday. You can try pulling in: > > git://git.kernel.dk/linux-block.git for-linus Thanks, I'll give that a try. - Ted ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: BUG: scheduling while atomic in blk_mq codepath? 2014-06-19 16:08 ` Theodore Ts'o @ 2014-06-19 16:21 ` Theodore Ts'o 2014-06-19 22:38 ` Dave Chinner 0 siblings, 1 reply; 21+ messages in thread From: Theodore Ts'o @ 2014-06-19 16:21 UTC (permalink / raw) To: Jens Axboe, linux-kernel, linux-scsi On Thu, Jun 19, 2014 at 12:08:01PM -0400, Theodore Ts'o wrote: > > The other issue, not sure, not a lot of detail. It may be fixed by the pull > > request I sent out yesterday. You can try pulling in: > > > > git://git.kernel.dk/linux-block.git for-linus > > Thanks, I'll give that a try. I tried merging in your for-linus branch in v3.16-rc1, and I'm seeing the following. On a 32-bit x86 3.15 kernel, run: "mke2fs -t ext3 /dev/vdc" where /dev/vdc is a 5 gig virtio partition. The boot 3.16-rc1 with linux-block.git's for_linus branch merged in. (Or 3.16-rc1 by itself): root@candygram:~# e2fsck -fy /dev/vdc e2fsck 1.43-WIP (4-Feb-2014) Pass 1: Checking inodes, blocks, and sizes [ 22.872217] random: nonblocking pool is initialized Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Error reading block 262144 (Attempt to read block from filesystem resulted in short read) while reading inode and block bitmaps. Ignore error? yes Force rewrite? yes Block bitmap differences: +(262144--262657) Fix? yes /dev/vdc: ***** FILE SYSTEM WAS MODIFIED ***** /dev/vdc: 11/327680 files (0.0% non-contiguous), 55935/1310720 blocks With the 3.15 kernel, this works fine. - Ted ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: BUG: scheduling while atomic in blk_mq codepath? 2014-06-19 16:21 ` Theodore Ts'o @ 2014-06-19 22:38 ` Dave Chinner 2014-06-21 3:51 ` 32-bit bug in iovec iterator changes Theodore Ts'o 0 siblings, 1 reply; 21+ messages in thread From: Dave Chinner @ 2014-06-19 22:38 UTC (permalink / raw) To: Theodore Ts'o, Jens Axboe, linux-kernel, linux-scsi On Thu, Jun 19, 2014 at 12:21:44PM -0400, Theodore Ts'o wrote: > On Thu, Jun 19, 2014 at 12:08:01PM -0400, Theodore Ts'o wrote: > > > The other issue, not sure, not a lot of detail. It may be fixed by the pull > > > request I sent out yesterday. You can try pulling in: > > > > > > git://git.kernel.dk/linux-block.git for-linus > > > > Thanks, I'll give that a try. > > I tried merging in your for-linus branch in v3.16-rc1, and I'm seeing > the following. On a 32-bit x86 3.15 kernel, run: "mke2fs -t ext3 > /dev/vdc" where /dev/vdc is a 5 gig virtio partition. Short reads are more likely a bug in all the iovec iterator stuff that got merged in from the vfs tree. ISTR a 32 bit-only bug in that stuff go past in to do with not being able to partition a 32GB block dev on a 32 bit system due to a 32 bit size_t overflow somewhere.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* 32-bit bug in iovec iterator changes 2014-06-19 22:38 ` Dave Chinner @ 2014-06-21 3:51 ` Theodore Ts'o 2014-06-21 5:53 ` Al Viro 0 siblings, 1 reply; 21+ messages in thread From: Theodore Ts'o @ 2014-06-21 3:51 UTC (permalink / raw) To: Al Viro; +Cc: Dave Chinner, Jens Axboe, linux-kernel, linux-scsi On Fri, Jun 20, 2014 at 08:38:20AM +1000, Dave Chinner wrote: > > Short reads are more likely a bug in all the iovec iterator stuff > that got merged in from the vfs tree. ISTR a 32 bit-only bug in that > stuff go past in to do with not being able to partition a 32GB block > dev on a 32 bit system due to a 32 bit size_t overflow somewhere.... Dave Chinner called it. Al, I'm seeing a regression which shows up using a 32-bit x86 kernel. The symptoms of the bug is when run under KVM, with a 5 GB /dev/vdc virtual block device, a read at offset 2 ** 30 fails with a short read: # dd if=/dev/vdc of=/dev/null bs=4k skip=262144 count=1 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.0164144 s, 0.0 kB/s On a 3.15 kernel, this command works: # dd if=/dev/vdc of=/dev/null bs=4k skip=262144 count=1 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.0457984 s, 89.4 kB/s I tried bisecting it, but unfortunately the iovec iterator changes are not cleanly bisectable, since copy_page_from_iter() gets introduced some two dozen patches before it gets defined. :-( However, the bisect leads quite squarely to to the iovec iterator patches. Al, I'd appreciate it if you could take a look? Thanks!! - Ted % git bisect start # good: [1860e379875dfe7271c649058aeddffe5afd9d0d] Linux 3.15 git bisect good 1860e379875dfe7271c649058aeddffe5afd9d0d # bad: [7171511eaec5bf23fb06078f59784a3a0626b38f] Linux 3.16-rc1 git bisect bad 7171511eaec5bf23fb06078f59784a3a0626b38f # good: [aaeb2554337217dfa4eac2fcc90da7be540b9a73] Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media into next git bisect good aaeb2554337217dfa4eac2fcc90da7be540b9a73 # bad: [16b9057804c02e2d351e9c8f606e909b43cbd9e7] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs git bisect bad 16b9057804c02e2d351e9c8f606e909b43cbd9e7 # good: [82abb273d838318424644d8f02825db0fbbd400a] Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus git bisect good 82abb273d838318424644d8f02825db0fbbd400a # good: [d1e1cda862c16252087374ac75949b0e89a5717e] Merge tag 'nfs-for-3.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs git bisect good d1e1cda862c16252087374ac75949b0e89a5717e # good: [23d4ed53b7342bf5999b3ea227d9f69e75e5a625] Merge branch 'for-linus' of git://git.kernel.dk/linux-block git bisect good 23d4ed53b7342bf5999b3ea227d9f69e75e5a625 # good: [2840c566e95599cd60c7143762ca8b49d9395050] Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs git bisect good 2840c566e95599cd60c7143762ca8b49d9395050 # good: [4251c2a67011801caecd63671f26dd8c9aedb24c] Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux git bisect good 4251c2a67011801caecd63671f26dd8c9aedb24c # skip: [3dae8750c368f8ac11c3c8c2a28f56dcee865c01] cifs: switch to ->write_iter() git bisect skip 3dae8750c368f8ac11c3c8c2a28f56dcee865c01 # good: [5c02c392cd2320e8d612376d6b72b6548a680923] Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux git bisect good 5c02c392cd2320e8d612376d6b72b6548a680923 # bad: [5f073850602084fbcbb987948ff3e70ae273f7d2] kill generic_file_splice_write() git bisect bad 5f073850602084fbcbb987948ff3e70ae273f7d2 # good: [38583f095c5a8138ae2a1c9173d0fd8a9f10e8aa] Merge branch 'akpm' (incoming from Andrew) git bisect good 38583f095c5a8138ae2a1c9173d0fd8a9f10e8aa # good: [f5ccfe1ddbaf9d923a3ebdadcb1e5e32d83e9c28] ext4: fix locking for O_APPEND writes git bisect good f5ccfe1ddbaf9d923a3ebdadcb1e5e32d83e9c28 # bad: [f0d1bec9d58d4c038d0ac958c9af82be6eb18045] new helper: copy_page_from_iter() git bisect bad f0d1bec9d58d4c038d0ac958c9af82be6eb18045 % (unset DISPLAY; git bisect visualize) commit f0d1bec9d58d4c038d0ac958c9af82be6eb18045 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 3 15:05:18 2014 -0400 new helper: copy_page_from_iter() parallel to copy_page_to_iter(). pipe_write() switched to it (and became ->write_iter()). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 84c3d55cc474f9c234c023c92e2769f940d5548c Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 3 14:33:23 2014 -0400 fuse: switch to ->write_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit b30ac0fc4109701fc122d41ee085c65b52dc44a3 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 3 14:29:04 2014 -0400 btrfs: switch to ->write_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 3ef045c3d8ae8550abbfd44074efce6ff642cc86 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 3 14:25:22 2014 -0400 ocfs2: switch to ->write_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit bf97f3bc0c32140c43fe5ca53d23514ea46a54ca Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 3 14:20:23 2014 -0400 xfs: switch to ->write_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 50b5551d1719c8bce60c6d4027b814cfc72c2307 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 3 14:13:46 2014 -0400 afs: switch to ->write_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit da56e45b6ee83b67a586c61819cd2b5cfd806eb8 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 3 14:11:01 2014 -0400 gfs2: switch to ->write_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit edaf43694898c5d7deb9a394335c60e888039100 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 3 14:07:25 2014 -0400 nfs: switch to ->write_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit f5674c31ee1f968606702e82c160d6ae11032ded Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 3 14:00:23 2014 -0400 ubifs: switch to ->write_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit a8f3550cd228b6edc5d17fce1a9af8cc7004f185 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 3 03:32:25 2014 -0400 bury __generic_file_aio_write() all users converted to __generic_file_write_iter() now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 3dae8750c368f8ac11c3c8c2a28f56dcee865c01 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 3 12:05:17 2014 -0400 cifs: switch to ->write_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit d4637bc18f3bf89bd63fe7b967043523aa3a6170 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 3 03:31:17 2014 -0400 udf: switch to ->write_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 9b884164d59707216840159d45f6be68073fac6e Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 17 16:09:22 2014 -0400 convert ext4 to ->write_iter() unfortunately, Ted's changes to ext4_file_write() are *still* an incomplete fix - playing with rlimits can let you smuggle an unaligned request past the checks. So there almost certainly will be more merge PITA around that place... [fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit a8324754889c0a491b216bc0502ef9ba557eeac7 Merge: 1456c0a f5ccfe1 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Tue May 6 17:38:41 2014 -0400 Merge ext4 changes in ext4_file_write() into for-next From ext4.git#dev, needed for switch of ext4 to ->write_iter() ;-/ commit 1456c0a87c4241d3a801651019e66983c69ad17d Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 3 03:21:50 2014 -0400 blkdev_aio_write() - turn into blkdev_write_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 8174202b34c30e0c07231bf63f18ab29af634f0b Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 3 03:17:43 2014 -0400 write_iter variants of {__,}generic_file_aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 3644424dc6309439c4c8d97590cdac4100376255 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Apr 2 20:28:01 2014 -0400 ceph: switch to ->read_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 3aa2d199f8eb8149a88005e88736d583cbc39d31 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Apr 2 20:14:12 2014 -0400 nfs: switch to ->read_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit a886038baa48a17f2cf77fb5dcc204fd28659d8f Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Apr 2 20:02:21 2014 -0400 fs/block_dev.c: switch to ->read_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 2ba5bbed0cd7429dbd567fa885ae3bc7a76de3d4 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Apr 2 20:00:02 2014 -0400 shmem: switch to ->read_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit fb9096a344e2964c6a71520931c08abb1301248e Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Apr 2 19:56:54 2014 -0400 pipe: switch to ->read_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit e6a7bcb4c489e3e078ba3cc92ae6621b2b8bb9a7 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Apr 2 19:53:36 2014 -0400 cifs: switch to ->read_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 37c20f16e7a73e5fe34815e785ca6c5a46e4d260 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Apr 2 14:47:09 2014 -0400 fuse_file_aio_read(): convert to ->read_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 3cd9ad5a303a0d503492002c4af95becfa99af03 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Apr 2 14:44:18 2014 -0400 ocfs2: switch to ->read_iter() tracepoints are evil, exhibit #6969... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 027978295d455b2fdff0cb36570f83ada7385ea6 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Apr 2 14:40:38 2014 -0400 ecryptfs: switch to ->read_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit b4f5d2c6d1f88c79e48f1296076b3a6a22f58c0f Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Apr 2 14:37:59 2014 -0400 xfs: switch to ->read_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit aad4f8bb42af06371aa0e85bf0cd9d52c0494985 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Apr 2 14:33:16 2014 -0400 switch simple generic_file_aio_read() users to ->read_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 293bc9822fa9b3c9d4b7893bcb241e085580771a Author: Al Viro <viro@zeniv.linux.org.uk> Date: Tue Feb 11 18:37:41 2014 -0500 new methods: ->read_iter() and ->write_iter() Beginning to introduce those. Just the callers for now, and it's clumsier than it'll eventually become; once we finish converting aio_read and aio_write instances, the things will get nicer. For now, these guys are in parallel to ->aio_read() and ->aio_write(); they take iocb and iov_iter, with everything in iov_iter already validated. File offset is passed in iocb->ki_pos, iov/nr_segs - in iov_iter. Main concerns in that series are stack footprint and ability to split the damn thing cleanly. [fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 7f7f25e82d54870df24d415a7007fbd327da027b Author: Al Viro <viro@zeniv.linux.org.uk> Date: Tue Feb 11 17:49:24 2014 -0500 replace checking for ->read/->aio_read presence with check in ->f_mode Since we are about to introduce new methods (read_iter/write_iter), the tests in a bunch of places would have to grow inconveniently. Check once (at open() time) and store results in ->f_mode as FMODE_CAN_READ and FMODE_CAN_WRITE resp. It might end up being a temporary measure - once everything switches from ->aio_{read,write} to ->{read,write}_iter it might make sense to return to open-coded checks. We'll see... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit b318891929c2750055a4002bee3e7636ca3684de Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Apr 2 07:06:30 2014 -0400 xfs: trim the argument lists of xfs_file_{dio,buffered}_aio_write() pos is redundant (it's iocb->ki_pos), and iov/nr_segs/count are taken care of by lifting iov_iter into the caller. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 37938463540b075e9166cf774c59274379f7a8ca Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sat Mar 22 06:57:37 2014 -0400 blkdev_aio_read(): switch to generic_file_read_iter(), get rid of iov_shorten() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 0c949334a9e2581646c6ff0d1470a805b1e5be99 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sat Mar 22 06:51:37 2014 -0400 iov_iter_truncate() Now It Can Be Done(tm) - we don't need to do iov_shorten() in generic_file_direct_write() anymore, now that all ->direct_IO() instances are converted to proper iov_iter methods and honour iter->count and iter->iov_offset properly. Get rid of count/ocount arguments of generic_file_direct_write(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 28060d5d9b261da110afe48aae7a2aa6555f798f Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sat Mar 22 05:15:17 2014 -0400 btrfs: switch check_direct_IO() to iov_iter ... and don't open-code iov_iter_alignment() there Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 91f79c43d1b54d7154b118860d81b39bad07dfff Author: Al Viro <viro@zeniv.linux.org.uk> Date: Fri Mar 21 04:58:33 2014 -0400 new helper: iov_iter_get_pages_alloc() same as iov_iter_get_pages(), except that pages array is allocated (kmalloc if possible, vmalloc if that fails) and left for caller to free. Lustre and NFS ->direct_IO() switched to it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit f67da30c1d5fc9e341bc8121708874bfd7b31e45 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Mar 19 01:16:16 2014 -0400 new helper: iov_iter_npages() counts the pages covered by iov_iter, up to given limit. do_block_direct_io() and fuse_iter_npages() switched to it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 5b46f25ddc6edf4adff1a137d453da542c27a640 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sun Mar 16 18:07:34 2014 -0400 f2fs: switch to iov_iter_alignment() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit c9c37e2e63786c595d704244cbb7d19dc5630493 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sun Mar 16 16:08:30 2014 -0400 fuse: switch to iov_iter_get_pages() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit d22a943f44c79c98ac7a93653fdd330378581741 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sun Mar 16 15:50:47 2014 -0400 fuse: pull iov_iter initializations up ... to fuse_direct_{read,write}(). ->direct_IO() path uses the iov_iter passed by the caller instead. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 7b2c99d15559e285384c742db52316802e24b0bd Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sat Mar 15 04:05:57 2014 -0400 new helper: iov_iter_get_pages() iov_iter_get_pages(iter, pages, maxsize, &start) grabs references pinning the pages of up to maxsize of (contiguous) data from iter. Returns the amount of memory grabbed or -error. In case of success, the requested area begins at offset start in pages[0] and runs through pages[1], etc. Less than requested amount might be returned - either because the contiguous area in the beginning of iterator is smaller than requested, or because the kernel failed to pin that many pages. direct-io.c switched to using iov_iter_get_pages() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 3320c60b3a26d05666285c55ab08ee043c017ba3 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Mon Mar 10 02:30:55 2014 -0400 dio: take updating ->result into do_direct_IO() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 71d8e532b1549a478e6a6a8a44f309d050294d00 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Mar 5 19:28:09 2014 -0500 start adding the tag to iov_iter For now, just use the same thing we pass to ->direct_IO() - it's all iovec-based at the moment. Pass it explicitly to iov_iter_init() and account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO() uses. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit ed978a811ec528dbe40243605c3afab55892f722 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Mar 5 22:53:04 2014 -0500 new helper: generic_file_read_iter() iov_iter-using variant of generic_file_aio_read(). Some callers converted. Note that it's still not quite there for use as ->read_iter() - we depend on having zero iter->iov_offset in O_DIRECT case. Fortunately, that's true for all converted callers (and for generic_file_aio_read() itself). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 23faa7b8db9be0be4f158cfc558460bb95d9b245 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Mar 5 22:52:34 2014 -0500 fuse_file_aio_write(): merge initializations of iov_iter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 05bb2e0bc77cb005248be318d2b0ba369b8bbab3 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Mar 5 19:22:23 2014 -0500 ceph_aio_read(): keep iov_iter across retries Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 886a39115005ced8b15ab067c9c2a8d546b40a5e Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Mar 5 13:50:45 2014 -0500 new primitive: iov_iter_alignment() returns the value aligned as badly as the worst remaining segment in iov_iter is. Use instead of open-coded equivalents. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 26978b8b4d83c46f4310b253db70fa9e65149e7c Author: Al Viro <viro@zeniv.linux.org.uk> Date: Mon Mar 10 14:08:45 2014 -0400 give ->direct_IO() a copy of iov_iter the thing is, we want to advance what's given to ->direct_IO() as we are forming the request; however, the callers care about the amount of data actually transferred, not the amount we tried to transfer. It's more convenient to allow ->direct_IO() instances do use iov_iter_advance() on the copy of iov_iter, leaving the actual advancing of the original to caller. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 31b140398ce56ab41646eda7f02bcb78d6a4c916 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Mar 5 01:33:16 2014 -0500 switch {__,}blockdev_direct_IO() to iov_iter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit a6cbcd4a4a85e2fdb0b3344b88df2e8b3d526b9e Author: Al Viro <viro@zeniv.linux.org.uk> Date: Tue Mar 4 22:38:00 2014 -0500 get rid of pointless iov_length() in ->direct_IO() all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 16b1f05d7f5ab4ce570963aca5f3b2b5d21822fa Author: Al Viro <viro@zeniv.linux.org.uk> Date: Tue Mar 4 22:14:00 2014 -0500 ext4: switch the guts of ->direct_IO() to iov_iter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 619d30b4b8c488042b4a720ca79dccc346d1a516 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Tue Mar 4 21:53:33 2014 -0500 convert the guts of nfs_direct_IO() to iov_iter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit d8d3d94b80aa1a1c0ca75c58b8abdc7356f38418 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Tue Mar 4 21:27:34 2014 -0500 pass iov_iter to ->direct_IO() unmodified, for now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit cb66a7a1f149ff705fa37cad6d1252b046e0ad4f Author: Al Viro <viro@zeniv.linux.org.uk> Date: Tue Mar 4 15:24:06 2014 -0500 kill generic_segment_checks() all callers of ->aio_read() and ->aio_write() have iov/nr_segs already checked - generic_segment_checks() done after that is just an odd way to spell iov_length(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit 0ae5e4d370599592eab845527b31708a4f3411be Author: Al Viro <viro@zeniv.linux.org.uk> Date: Mon Mar 3 22:09:39 2014 -0500 __btrfs_direct_write(): switch to iov_iter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit f8579f8673b7ecdb7a81d5d5bb1d981093d9aa94 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Mon Mar 3 22:03:20 2014 -0500 generic_file_direct_write(): switch to iov_iter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit e7c24607b5d68a4cdc56e09d70a3c8bae5f0519f Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu Apr 10 20:54:51 2014 -0400 kill iov_iter_copy_from_user() all callers can use copy_page_from_iter() and it actually simplifies them. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> commit f6c0a1920e0180175bd5e8e4aff8ea5556f1895d Author: Al Viro <viro@zeniv.linux.org.uk> Date: Wed Apr 23 10:18:46 2014 -0400 fs/file.c: don't open-code kvfree() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 32-bit bug in iovec iterator changes 2014-06-21 3:51 ` 32-bit bug in iovec iterator changes Theodore Ts'o @ 2014-06-21 5:53 ` Al Viro 2014-06-21 23:09 ` Theodore Ts'o 0 siblings, 1 reply; 21+ messages in thread From: Al Viro @ 2014-06-21 5:53 UTC (permalink / raw) To: Theodore Ts'o, Dave Chinner, Jens Axboe, linux-kernel, linux-scsi On Fri, Jun 20, 2014 at 11:51:44PM -0400, Theodore Ts'o wrote: > On Fri, Jun 20, 2014 at 08:38:20AM +1000, Dave Chinner wrote: > > > > Short reads are more likely a bug in all the iovec iterator stuff > > that got merged in from the vfs tree. ISTR a 32 bit-only bug in that > > stuff go past in to do with not being able to partition a 32GB block > > dev on a 32 bit system due to a 32 bit size_t overflow somewhere.... > > Dave Chinner called it. > > Al, I'm seeing a regression which shows up using a 32-bit x86 kernel. > The symptoms of the bug is when run under KVM, with a 5 GB /dev/vdc > virtual block device, a read at offset 2 ** 30 fails with a short > read: > > # dd if=/dev/vdc of=/dev/null bs=4k skip=262144 count=1 > 0+0 records in > 0+0 records out > 0 bytes (0 B) copied, 0.0164144 s, 0.0 kB/s Argh... ed include/linux/uio.h <<EOF /iov_iter_truncate/s/size_t/u64/ w q EOF Could you check if that fixes the sucker? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 32-bit bug in iovec iterator changes 2014-06-21 5:53 ` Al Viro @ 2014-06-21 23:09 ` Theodore Ts'o 2014-06-21 23:49 ` Al Viro 0 siblings, 1 reply; 21+ messages in thread From: Theodore Ts'o @ 2014-06-21 23:09 UTC (permalink / raw) To: Al Viro; +Cc: Dave Chinner, Jens Axboe, linux-kernel, linux-scsi On Sat, Jun 21, 2014 at 06:53:07AM +0100, Al Viro wrote: > > ed include/linux/uio.h <<EOF > /iov_iter_truncate/s/size_t/u64/ > w > q > EOF > > Could you check if that fixes the sucker? The following patch (attached at the end) appears to fix the problem, but looking at uio.h, I'm completely confused about *why* it fixes the problem. In particular, iov_iter_iovec() makes no sense to me at all, and I don't understand how the calculation of iov_len makes any sense: .iov_len = min(iter->count, iter->iov->iov_len - iter->iov_offset), It also looks like uio.h is mostly about offsets to memory pointers, and so why this would make a difference when the issue is the block device offset goes above 2**30? There must be deep magic going on here, and so I don't know if your s/size_t/u64/g substitation also extends to the various functions that have size_t in them: unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to); size_t iov_iter_copy_from_user_atomic(struct page *page, struct iov_iter *i, unsigned long offset, size_t bytes); void iov_iter_advance(struct iov_iter *i, size_t bytes); int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes); size_t iov_iter_single_seg_count(const struct iov_iter *i); size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes, struct iov_iter *i); size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes, struct iov_iter *i); unsigned long iov_iter_alignment(const struct iov_iter *i); void iov_iter_init(struct iov_iter *i, int direction, const struct iovec *iov, unsigned long nr_segs, size_t count); ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages, size_t maxsize, size_t *start); ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages, size_t maxsize, size_t *start); Anyway, this patch does appear to make the problem go away, but given that I don't understand what is going on here, please take it with a huge grain of salt. And might I suggest some comments to perhaps give some context to someone who is trying to understand include/linux/uio.h? Thanks!! - Ted diff --git a/include/linux/uio.h b/include/linux/uio.h index e2231e4..bea7b7d 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -16,7 +16,7 @@ struct page; struct kvec { void *iov_base; /* and that should *never* hold a userland pointer */ - size_t iov_len; + u64 iov_len; }; enum { @@ -27,8 +27,8 @@ enum { struct iov_iter { int type; - size_t iov_offset; - size_t count; + u64 iov_offset; + u64 count; union { const struct iovec *iov; const struct bio_vec *bvec; @@ -41,12 +41,12 @@ struct iov_iter { * * NOTE that it is not safe to use this function until all the iovec's * segment lengths have been validated. Because the individual lengths can - * overflow a size_t when added together. + * overflow a u64 when added together. */ -static inline size_t iov_length(const struct iovec *iov, unsigned long nr_segs) +static inline u64 iov_length(const struct iovec *iov, unsigned long nr_segs) { unsigned long seg; - size_t ret = 0; + u64 ret = 0; for (seg = 0; seg < nr_segs; seg++) ret += iov[seg].iov_len; @@ -89,12 +89,12 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages, size_t maxsize, size_t *start); int iov_iter_npages(const struct iov_iter *i, int maxpages); -static inline size_t iov_iter_count(struct iov_iter *i) +static inline u64 iov_iter_count(struct iov_iter *i) { return i->count; } -static inline void iov_iter_truncate(struct iov_iter *i, size_t count) +static inline void iov_iter_truncate(struct iov_iter *i, u64 count) { if (i->count > count) i->count = count; @@ -104,7 +104,7 @@ static inline void iov_iter_truncate(struct iov_iter *i, size_t count) * reexpand a previously truncated iterator; count must be no more than how much * we had shrunk it. */ -static inline void iov_iter_reexpand(struct iov_iter *i, size_t count) +static inline void iov_iter_reexpand(struct iov_iter *i, u64 count) { i->count = count; } ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: 32-bit bug in iovec iterator changes 2014-06-21 23:09 ` Theodore Ts'o @ 2014-06-21 23:49 ` Al Viro 2014-06-22 0:03 ` James Bottomley 0 siblings, 1 reply; 21+ messages in thread From: Al Viro @ 2014-06-21 23:49 UTC (permalink / raw) To: Theodore Ts'o, Dave Chinner, Jens Axboe, linux-kernel, linux-scsi On Sat, Jun 21, 2014 at 07:09:22PM -0400, Theodore Ts'o wrote: > On Sat, Jun 21, 2014 at 06:53:07AM +0100, Al Viro wrote: > > > > ed include/linux/uio.h <<EOF > > /iov_iter_truncate/s/size_t/u64/ > > w > > q > > EOF > > > > Could you check if that fixes the sucker? > > The following patch (attached at the end) appears to fix the problem, > but looking at uio.h, I'm completely confused about *why* it fixes the > problem. In particular, iov_iter_iovec() makes no sense to me at all, > and I don't understand how the calculation of iov_len makes any sense: > > .iov_len = min(iter->count, > iter->iov->iov_len - iter->iov_offset), Eh? We have iov[0].iov_base..iov[0].iov_base+iov[0].iov_len - 1 for area covered by the first iovec. First iov_offset bytes have already been consumed. And at most count bytes matter. So yes, this iov_len will give you equivalent first iovec. Said that, iov_iter_iovec() will die shortly - it's a rudiment of older code, with almost no users left. But yes, it is correct. > It also looks like uio.h is mostly about offsets to memory pointers, > and so why this would make a difference when the issue is the block > device offset goes above 2**30? It is, and your patch is a huge overkill. > There must be deep magic going on here, and so I don't know if your > s/size_t/u64/g substitation also extends to the various functions that > have size_t in them: No, it does not. It's specifically about iov_iter_truncate(); moreover, it matters to only one caller of that sucker. Namely, static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to) { struct file *file = iocb->ki_filp; struct inode *bd_inode = file->f_mapping->host; loff_t size = i_size_read(bd_inode); loff_t pos = iocb->ki_pos; if (pos >= size) return 0; size -= pos; iov_iter_truncate(to, size); return generic_file_read_iter(iocb, to); } What happens here is capping to->count, to guarantee that we won't even look at anything past the end of block device. Alternative fix would be to have if (pos >= size) return 0; if (to->size + pos > size) { /* note that size - pos fits into size_t in this case, * so it's OK to pass it to iov_iter_truncate(). */ iov_iter_truncate(to, size - pos); } return generic_file_read_iter(iocb, to); in there. Other callers are passing it size_t values already, so we don't need similar checks there. Or we can make iov_iter_truncate() take an arbitrary u64 argument, seeing that it's inlined anyway. IMO it's more robust that way... Anyway, does the following alone fix the problem you are seeing? diff --git a/include/linux/uio.h b/include/linux/uio.h index ddfdb53..dbb02d4 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -94,7 +94,7 @@ static inline size_t iov_iter_count(struct iov_iter *i) return i->count; } -static inline void iov_iter_truncate(struct iov_iter *i, size_t count) +static inline void iov_iter_truncate(struct iov_iter *i, u64 count) { if (i->count > count) i->count = count; ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: 32-bit bug in iovec iterator changes 2014-06-21 23:49 ` Al Viro @ 2014-06-22 0:03 ` James Bottomley 2014-06-22 0:26 ` Al Viro 0 siblings, 1 reply; 21+ messages in thread From: James Bottomley @ 2014-06-22 0:03 UTC (permalink / raw) To: Al Viro Cc: Theodore Ts'o, Dave Chinner, Jens Axboe, linux-kernel, linux-scsi On Sun, 2014-06-22 at 00:49 +0100, Al Viro wrote: > On Sat, Jun 21, 2014 at 07:09:22PM -0400, Theodore Ts'o wrote: > > On Sat, Jun 21, 2014 at 06:53:07AM +0100, Al Viro wrote: > > > > > > ed include/linux/uio.h <<EOF > > > /iov_iter_truncate/s/size_t/u64/ > > > w > > > q > > > EOF > > > > > > Could you check if that fixes the sucker? > > > > The following patch (attached at the end) appears to fix the problem, > > but looking at uio.h, I'm completely confused about *why* it fixes the > > problem. In particular, iov_iter_iovec() makes no sense to me at all, > > and I don't understand how the calculation of iov_len makes any sense: > > > > .iov_len = min(iter->count, > > iter->iov->iov_len - iter->iov_offset), > > Eh? We have iov[0].iov_base..iov[0].iov_base+iov[0].iov_len - 1 for > area covered by the first iovec. First iov_offset bytes have already > been consumed. And at most count bytes matter. So yes, this iov_len > will give you equivalent first iovec. > > Said that, iov_iter_iovec() will die shortly - it's a rudiment of older > code, with almost no users left. But yes, it is correct. > > > It also looks like uio.h is mostly about offsets to memory pointers, > > and so why this would make a difference when the issue is the block > > device offset goes above 2**30? > > It is, and your patch is a huge overkill. > > > There must be deep magic going on here, and so I don't know if your > > s/size_t/u64/g substitation also extends to the various functions that > > have size_t in them: > > No, it does not. It's specifically about iov_iter_truncate(); moreover, > it matters to only one caller of that sucker. Namely, > > static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to) > { > struct file *file = iocb->ki_filp; > struct inode *bd_inode = file->f_mapping->host; > loff_t size = i_size_read(bd_inode); > loff_t pos = iocb->ki_pos; > > if (pos >= size) > return 0; > > size -= pos; > iov_iter_truncate(to, size); > return generic_file_read_iter(iocb, to); > } > > What happens here is capping to->count, to guarantee that we won't even look > at anything past the end of block device. Alternative fix would be to > have > if (pos >= size) > return 0; > if (to->size + pos > size) { > /* note that size - pos fits into size_t in this case, > * so it's OK to pass it to iov_iter_truncate(). > */ > iov_iter_truncate(to, size - pos); > } > return generic_file_read_iter(iocb, to); > in there. Other callers are passing it size_t values already, so we don't > need similar checks there. > > Or we can make iov_iter_truncate() take an arbitrary u64 argument, seeing that > it's inlined anyway. IMO it's more robust that way... > > Anyway, does the following alone fix the problem you are seeing? > > diff --git a/include/linux/uio.h b/include/linux/uio.h > index ddfdb53..dbb02d4 100644 > --- a/include/linux/uio.h > +++ b/include/linux/uio.h > @@ -94,7 +94,7 @@ static inline size_t iov_iter_count(struct iov_iter *i) > return i->count; > } > > -static inline void iov_iter_truncate(struct iov_iter *i, size_t count) > +static inline void iov_iter_truncate(struct iov_iter *i, u64 count) > { > if (i->count > count) > i->count = count; Al, how can that work? i->count is size_t, which is 32 bit, so we're going to get truncation errors. I could see this possibly working if count in struct iov_iter becomes u64 (which is going to have a lot of knock on consequences, but it seems to me that at least kvec.iov_len and iov_iter.iov_offset have to become u64 as well. James ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 32-bit bug in iovec iterator changes 2014-06-22 0:03 ` James Bottomley @ 2014-06-22 0:26 ` Al Viro 2014-06-22 0:32 ` James Bottomley 0 siblings, 1 reply; 21+ messages in thread From: Al Viro @ 2014-06-22 0:26 UTC (permalink / raw) To: James Bottomley Cc: Theodore Ts'o, Dave Chinner, Jens Axboe, linux-kernel, linux-scsi On Sat, Jun 21, 2014 at 05:03:20PM -0700, James Bottomley wrote: > > Anyway, does the following alone fix the problem you are seeing? > > > > diff --git a/include/linux/uio.h b/include/linux/uio.h > > index ddfdb53..dbb02d4 100644 > > --- a/include/linux/uio.h > > +++ b/include/linux/uio.h > > @@ -94,7 +94,7 @@ static inline size_t iov_iter_count(struct iov_iter *i) > > return i->count; > > } > > > > -static inline void iov_iter_truncate(struct iov_iter *i, size_t count) > > +static inline void iov_iter_truncate(struct iov_iter *i, u64 count) > > { > > if (i->count > count) > > i->count = count; > > Al, how can that work? i->count is size_t, which is 32 bit, so we're > going to get truncation errors. No, we are not. Look: * comparison promotes both operands to u64 here, so its result is accurate, no matter how large count is. They are compared as natural numbers. * assignment converts count to size_t, which *would* truncate for values that are greater than the maximal value representable by size_t. But in that case it's by definition greater than i->count, so we do not reach that assignment at all. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 32-bit bug in iovec iterator changes 2014-06-22 0:26 ` Al Viro @ 2014-06-22 0:32 ` James Bottomley 2014-06-22 0:53 ` Al Viro 0 siblings, 1 reply; 21+ messages in thread From: James Bottomley @ 2014-06-22 0:32 UTC (permalink / raw) To: Al Viro Cc: Theodore Ts'o, Dave Chinner, Jens Axboe, linux-kernel, linux-scsi On Sun, 2014-06-22 at 01:26 +0100, Al Viro wrote: > On Sat, Jun 21, 2014 at 05:03:20PM -0700, James Bottomley wrote: > > > > Anyway, does the following alone fix the problem you are seeing? > > > > > > diff --git a/include/linux/uio.h b/include/linux/uio.h > > > index ddfdb53..dbb02d4 100644 > > > --- a/include/linux/uio.h > > > +++ b/include/linux/uio.h > > > @@ -94,7 +94,7 @@ static inline size_t iov_iter_count(struct iov_iter *i) > > > return i->count; > > > } > > > > > > -static inline void iov_iter_truncate(struct iov_iter *i, size_t count) > > > +static inline void iov_iter_truncate(struct iov_iter *i, u64 count) > > > { > > > if (i->count > count) > > > i->count = count; > > > > Al, how can that work? i->count is size_t, which is 32 bit, so we're > > going to get truncation errors. > > No, we are not. Look: > * comparison promotes both operands to u64 here, so its result is > accurate, no matter how large count is. They are compared as natural > numbers. True ... figured this out 10 seconds after sending the email. > * assignment converts count to size_t, which *would* truncate for > values that are greater than the maximal value representable by size_t. > But in that case it's by definition greater than i->count, so we do not > reach that assignment at all. OK, so what I still don't get is why isn't the compiler warning when we truncate a u64 to a u32? We should get that warning in your new code, and we should have got that warning in fs/block_dev.c where it would have pinpointed the actual problem. James ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 32-bit bug in iovec iterator changes 2014-06-22 0:32 ` James Bottomley @ 2014-06-22 0:53 ` Al Viro 2014-06-22 1:00 ` Al Viro 2014-06-22 1:00 ` 32-bit bug in iovec iterator changes James Bottomley 0 siblings, 2 replies; 21+ messages in thread From: Al Viro @ 2014-06-22 0:53 UTC (permalink / raw) To: James Bottomley Cc: Theodore Ts'o, Dave Chinner, Jens Axboe, linux-kernel, linux-scsi On Sat, Jun 21, 2014 at 05:32:44PM -0700, James Bottomley wrote: > > No, we are not. Look: > > * comparison promotes both operands to u64 here, so its result is > > accurate, no matter how large count is. They are compared as natural > > numbers. > > True ... figured this out 10 seconds after sending the email. > > > * assignment converts count to size_t, which *would* truncate for > > values that are greater than the maximal value representable by size_t. > > But in that case it's by definition greater than i->count, so we do not > > reach that assignment at all. > > OK, so what I still don't get is why isn't the compiler warning when we > truncate a u64 to a u32? We should get that warning in your new code, > and we should have got that warning in fs/block_dev.c where it would > have pinpointed the actual problem. In which universe? extern void f(unsigned int); void g(unsigned long x) { f(x); } is perfectly valid C, with no warnings in sight. f(1UL << 32) might give one, but not this... ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 32-bit bug in iovec iterator changes 2014-06-22 0:53 ` Al Viro @ 2014-06-22 1:00 ` Al Viro 2014-06-22 11:50 ` Theodore Ts'o 2014-06-22 1:00 ` 32-bit bug in iovec iterator changes James Bottomley 1 sibling, 1 reply; 21+ messages in thread From: Al Viro @ 2014-06-22 1:00 UTC (permalink / raw) To: James Bottomley Cc: Theodore Ts'o, Dave Chinner, Jens Axboe, linux-kernel, linux-scsi On Sun, Jun 22, 2014 at 01:53:52AM +0100, Al Viro wrote: > On Sat, Jun 21, 2014 at 05:32:44PM -0700, James Bottomley wrote: > > > No, we are not. Look: > > > * comparison promotes both operands to u64 here, so its result is > > > accurate, no matter how large count is. They are compared as natural > > > numbers. > > > > True ... figured this out 10 seconds after sending the email. > > > > > * assignment converts count to size_t, which *would* truncate for > > > values that are greater than the maximal value representable by size_t. > > > But in that case it's by definition greater than i->count, so we do not > > > reach that assignment at all. > > > > OK, so what I still don't get is why isn't the compiler warning when we > > truncate a u64 to a u32? We should get that warning in your new code, > > and we should have got that warning in fs/block_dev.c where it would > > have pinpointed the actual problem. > > In which universe? > > extern void f(unsigned int); > > void g(unsigned long x) > { > f(x); > } > > is perfectly valid C, with no warnings in sight. f(1UL << 32) might > give one, but not this... PS: I agree that it's worth careful commenting, obviously, but before sending it to Linus (*with* comments) I want to get a confirmation that this one-liner actually fixes what Ted is seeing. I have reproduced it here, and that change makes the breakage go away in my testing, but I'd like to make sure that we are seeing the same thing. Ted? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 32-bit bug in iovec iterator changes 2014-06-22 1:00 ` Al Viro @ 2014-06-22 11:50 ` Theodore Ts'o 2014-06-23 7:44 ` [regression] fix 32-bit breakage in block device read(2) (was Re: 32-bit bug in iovec iterator changes) Al Viro 0 siblings, 1 reply; 21+ messages in thread From: Theodore Ts'o @ 2014-06-22 11:50 UTC (permalink / raw) To: Al Viro; +Cc: James Bottomley, Dave Chinner, Jens Axboe, linux-kernel, linux-scsi On Sun, Jun 22, 2014 at 02:00:32AM +0100, Al Viro wrote: > > PS: I agree that it's worth careful commenting, obviously, but > before sending it to Linus (*with* comments) I want to get a > confirmation that this one-liner actually fixes what Ted is seeing. > I have reproduced it here, and that change makes the breakage go > away in my testing, but I'd like to make sure that we are seeing the > same thing. Ted? Hep, that fixes things. Thanks for explaining what was going on! - Ted ^ permalink raw reply [flat|nested] 21+ messages in thread
* [regression] fix 32-bit breakage in block device read(2) (was Re: 32-bit bug in iovec iterator changes) 2014-06-22 11:50 ` Theodore Ts'o @ 2014-06-23 7:44 ` Al Viro 2014-06-23 15:43 ` Theodore Ts'o ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Al Viro @ 2014-06-23 7:44 UTC (permalink / raw) To: Linus Torvalds, Theodore Ts'o, James Bottomley, Dave Chinner, Jens Axboe, linux-kernel, linux-scsi On Sun, Jun 22, 2014 at 07:50:07AM -0400, Theodore Ts'o wrote: > On Sun, Jun 22, 2014 at 02:00:32AM +0100, Al Viro wrote: > > > > PS: I agree that it's worth careful commenting, obviously, but > > before sending it to Linus (*with* comments) I want to get a > > confirmation that this one-liner actually fixes what Ted is seeing. > > I have reproduced it here, and that change makes the breakage go > > away in my testing, but I'd like to make sure that we are seeing the > > same thing. Ted? > > Hep, that fixes things. Thanks for explaining what was going on! OK, here it is, hopefully with sufficient comments: blkdev_read_iter() wants to cap the iov_iter by the amount of data remaining to the end of device. That's what iov_iter_truncate() is for (trim iter->count if it's above the given limit). So far, so good, but the argument of iov_iter_truncate() is size_t, so on 32bit boxen (in case of a large device) we end up with that upper limit truncated down to 32 bits *before* comparing it with iter->count. Easily fixed by making iov_iter_truncate() take 64bit argument - it does the right thing after such change (we only reach the assignment in there when the current value of iter->count is greater than the limit, i.e. for anything that would get truncated we don't reach the assignment at all) and that argument is not the new value of iter->count - it's an upper limit for such. The overhead of passing u64 is not an issue - the thing is inlined, so callers passing size_t won't pay any penalty. Reported-by: Theodore Tso <tytso@mit.edu> Tested-by: Theodore Tso <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- diff --git a/include/linux/uio.h b/include/linux/uio.h index ddfdb53..17ae7e3 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -94,8 +94,20 @@ static inline size_t iov_iter_count(struct iov_iter *i) return i->count; } -static inline void iov_iter_truncate(struct iov_iter *i, size_t count) +/* + * Cap the iov_iter by given limit; note that the second argument is + * *not* the new size - it's upper limit for such. Passing it a value + * greater than the amount of data in iov_iter is fine - it'll just do + * nothing in that case. + */ +static inline void iov_iter_truncate(struct iov_iter *i, u64 count) { + /* + * count doesn't have to fit in size_t - comparison extends both + * operands to u64 here and any value that would be truncated by + * conversion in assignement is by definition greater than all + * values of size_t, including old i->count. + */ if (i->count > count) i->count = count; } ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [regression] fix 32-bit breakage in block device read(2) (was Re: 32-bit bug in iovec iterator changes) 2014-06-23 7:44 ` [regression] fix 32-bit breakage in block device read(2) (was Re: 32-bit bug in iovec iterator changes) Al Viro @ 2014-06-23 15:43 ` Theodore Ts'o 2014-06-24 12:33 ` One Thousand Gnomes 2014-06-25 16:56 ` Linus Torvalds 2014-06-26 15:27 ` Bruno Wolff III 2 siblings, 1 reply; 21+ messages in thread From: Theodore Ts'o @ 2014-06-23 15:43 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, James Bottomley, Dave Chinner, Jens Axboe, linux-kernel, linux-scsi On Mon, Jun 23, 2014 at 08:44:40AM +0100, Al Viro wrote: > > OK, here it is, hopefully with sufficient comments: The comments look really good. I assume you'll get this to Linus in time for 3.16-rc3? Many thanks!! - Ted ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [regression] fix 32-bit breakage in block device read(2) (was Re: 32-bit bug in iovec iterator changes) 2014-06-23 15:43 ` Theodore Ts'o @ 2014-06-24 12:33 ` One Thousand Gnomes 0 siblings, 0 replies; 21+ messages in thread From: One Thousand Gnomes @ 2014-06-24 12:33 UTC (permalink / raw) To: Theodore Ts'o Cc: Al Viro, Linus Torvalds, James Bottomley, Dave Chinner, Jens Axboe, linux-kernel, linux-scsi On Mon, 23 Jun 2014 11:43:02 -0400 "Theodore Ts'o" <tytso@mit.edu> wrote: > On Mon, Jun 23, 2014 at 08:44:40AM +0100, Al Viro wrote: > > > > OK, here it is, hopefully with sufficient comments: > > The comments look really good. I assume you'll get this to > Linus in time for 3.16-rc3? Fixes the 32GB 'can't partition' bug I reported too. Alan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [regression] fix 32-bit breakage in block device read(2) (was Re: 32-bit bug in iovec iterator changes) 2014-06-23 7:44 ` [regression] fix 32-bit breakage in block device read(2) (was Re: 32-bit bug in iovec iterator changes) Al Viro 2014-06-23 15:43 ` Theodore Ts'o @ 2014-06-25 16:56 ` Linus Torvalds 2014-06-26 15:27 ` Bruno Wolff III 2 siblings, 0 replies; 21+ messages in thread From: Linus Torvalds @ 2014-06-25 16:56 UTC (permalink / raw) To: Al Viro Cc: Theodore Ts'o, James Bottomley, Dave Chinner, Jens Axboe, Linux Kernel Mailing List, Linux SCSI List Al, just checking - did you expect me to take this from the email, or are you preparing a pull request? Linus On Mon, Jun 23, 2014 at 12:44 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > OK, here it is, hopefully with sufficient comments: ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [regression] fix 32-bit breakage in block device read(2) (was Re: 32-bit bug in iovec iterator changes) 2014-06-23 7:44 ` [regression] fix 32-bit breakage in block device read(2) (was Re: 32-bit bug in iovec iterator changes) Al Viro 2014-06-23 15:43 ` Theodore Ts'o 2014-06-25 16:56 ` Linus Torvalds @ 2014-06-26 15:27 ` Bruno Wolff III 2 siblings, 0 replies; 21+ messages in thread From: Bruno Wolff III @ 2014-06-26 15:27 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Theodore Ts'o, James Bottomley, Dave Chinner, Jens Axboe, linux-kernel, linux-scsi On Mon, Jun 23, 2014 at 08:44:40 +0100, Al Viro <viro@ZenIV.linux.org.uk> wrote: > >blkdev_read_iter() wants to cap the iov_iter by the amount of >data remaining to the end of device. That's what iov_iter_truncate() >is for (trim iter->count if it's above the given limit). So far, >so good, but the argument of iov_iter_truncate() is size_t, so on >32bit boxen (in case of a large device) we end up with that upper >limit truncated down to 32 bits *before* comparing it with iter->count. This seems to fix a problem I had (https://bugzilla.kernel.org/show_bug.cgi?id=78711) with a partition device (/dev/sda3) being zero size on 3.16 kernels. However I am having some other issues with 3.16 on i686 and the amount of testing was the raid array using /dev/sda3 appeared to start (which it hadn't previously), but the system hung before finishing the boot process. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 32-bit bug in iovec iterator changes 2014-06-22 0:53 ` Al Viro 2014-06-22 1:00 ` Al Viro @ 2014-06-22 1:00 ` James Bottomley 1 sibling, 0 replies; 21+ messages in thread From: James Bottomley @ 2014-06-22 1:00 UTC (permalink / raw) To: Al Viro Cc: Theodore Ts'o, Dave Chinner, Jens Axboe, linux-kernel, linux-scsi On Sun, 2014-06-22 at 01:53 +0100, Al Viro wrote: > On Sat, Jun 21, 2014 at 05:32:44PM -0700, James Bottomley wrote: > > > No, we are not. Look: > > > * comparison promotes both operands to u64 here, so its result is > > > accurate, no matter how large count is. They are compared as natural > > > numbers. > > > > True ... figured this out 10 seconds after sending the email. > > > > > * assignment converts count to size_t, which *would* truncate for > > > values that are greater than the maximal value representable by size_t. > > > But in that case it's by definition greater than i->count, so we do not > > > reach that assignment at all. > > > > OK, so what I still don't get is why isn't the compiler warning when we > > truncate a u64 to a u32? We should get that warning in your new code, > > and we should have got that warning in fs/block_dev.c where it would > > have pinpointed the actual problem. > > In which universe? > > extern void f(unsigned int); > > void g(unsigned long x) > { > f(x); > } In the one where the code is compiled with -Wconversion ... I'm just surprised, I thought we had this enabled. James ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2014-06-26 15:27 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-06-19 15:35 BUG: scheduling while atomic in blk_mq codepath? Theodore Ts'o 2014-06-19 15:59 ` Jens Axboe 2014-06-19 16:08 ` Theodore Ts'o 2014-06-19 16:21 ` Theodore Ts'o 2014-06-19 22:38 ` Dave Chinner 2014-06-21 3:51 ` 32-bit bug in iovec iterator changes Theodore Ts'o 2014-06-21 5:53 ` Al Viro 2014-06-21 23:09 ` Theodore Ts'o 2014-06-21 23:49 ` Al Viro 2014-06-22 0:03 ` James Bottomley 2014-06-22 0:26 ` Al Viro 2014-06-22 0:32 ` James Bottomley 2014-06-22 0:53 ` Al Viro 2014-06-22 1:00 ` Al Viro 2014-06-22 11:50 ` Theodore Ts'o 2014-06-23 7:44 ` [regression] fix 32-bit breakage in block device read(2) (was Re: 32-bit bug in iovec iterator changes) Al Viro 2014-06-23 15:43 ` Theodore Ts'o 2014-06-24 12:33 ` One Thousand Gnomes 2014-06-25 16:56 ` Linus Torvalds 2014-06-26 15:27 ` Bruno Wolff III 2014-06-22 1:00 ` 32-bit bug in iovec iterator changes James Bottomley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).