From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1765987AbcINVfB (ORCPT ); Wed, 14 Sep 2016 17:35:01 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:55380 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1765919AbcINVfA (ORCPT ); Wed, 14 Sep 2016 17:35:00 -0400 Date: Wed, 14 Sep 2016 22:34:58 +0100 From: Al Viro To: Linus Torvalds Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [RFC] writev() semantics with invalid iovec in the middle Message-ID: <20160914213457.GG2356@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.6.1 (2016-04-27) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Right now writev() with 3-iovec array that has unmapped address in the second element and total length less than PAGE_SIZE will write the first segment and stop at that. Among other things, it guarantees the short copy, and I would rather have it yeild 0-bytes write (and -EFAULT as return value). All POSIX has to say about that is this (in 2.3 Error Numbers): [EFAULT] Bad address. The system detected an invalid address in attempting to use an argument of a call. The reliable detection of this error cannot be guaranteed, and when not detected may result in the generation of a signal, indicating an address violation, which is sent to the process. Note that unmapped page in the middle of a range covered already can lead to the same kind of short write - i.e. if we have p = mmap(0, 3*4096, PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); munmap(p + 4096, 4096); fd = open("/tmp/foo", O_CREAT|O_TRUNC|O_RDWR, 0777); write(fd, p + 2048, 8192); write() will yield -EFAULT, not a 2Kb stored. The same will happen with writev(fd, &(struct iovec){p + 2048, 8192}, 1); BTW, adding lseek(fd, 2049, SEEK_SET); before that write (or writev) will result in 2047 bytes being written by the latter. IOW, we do not try to squeeze every byte that can be squeezed out of the buffer; generally, an unmapped address anywhere in PAGE_SIZE worth of data that would go into the same page-aligned chunk of destination can result in short write cut at the beginning of that chunk. iovec boundaries act as barriers to short writes, mostly by accident. Do we need to preserve that special treatment of iovec boundaries? I would really like to get rid of that - the current behaviour is an easy and reliable way to trigger a short copy case in ->write_end() and those are fairly brittle. Sure, we still need to cope with them, and I think I've got all instances in the current mainline fixed, but they are often suboptimal. Objections?