* updated orangefs tree at kernel.org @ 2015-10-07 20:07 Mike Marshall 2015-10-08 21:07 ` Al Viro 0 siblings, 1 reply; 13+ messages in thread From: Mike Marshall @ 2015-10-07 20:07 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel Greetings Al and fs-devel... The orangefs for-next tree over at kernel.org has been updated. It is rebased to 4.3-rc3 and has numerous new commits related to reviews received during the merge window... git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git for-next Martin Brandenburg, a Clemson student has worked on several of the issues pointed out during the merge window review, and Yi Liu, a Clemson sys-admin, is working on a patch that changes most occurrences of pvfs to orangefs. We hope to update our for-next branch one more time during this rc series with Yi's changes, and probably address an issue that the kbuild robots and other testers have reported since the for-next branch was updated. Arnd Bergmann even sent us a patch for the issue, that will probably be in the next update. Not only are more people helping and paying attention now, but there are some lurkers out there who are getting a lot of enjoyment out of the orangefs-related fs-devel thread from the merge window where I got some long emails from Linus with smiley-faces in them, and even longer emails from Al with cuss-word acronyms in them <g>... -Mike ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: updated orangefs tree at kernel.org 2015-10-07 20:07 updated orangefs tree at kernel.org Mike Marshall @ 2015-10-08 21:07 ` Al Viro 2015-10-08 23:03 ` Al Viro 0 siblings, 1 reply; 13+ messages in thread From: Al Viro @ 2015-10-08 21:07 UTC (permalink / raw) To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel Something's fishy around pvfs2_inode_read(). Look: * ->readpage (== read_one_page) is called in normal context; no set_fs() in sight. * read_one_page() does kmap() of the page it had been given. Result is a kernel pointer. That pointer is (mis)annotated as userland one and passed to pvfs2_inode_read(). * pvfs2_inode_read() shoves it into struct iovec and passes that to wait_for_direct_io() (type == PVFS_IO_READ). From there it gets passed to postcopy_buffers(), where we do iov_iter_init(&iter, READ, vec, nr_segs, total_size); ret = pvfs_bufmap_copy_to_iovec(bufmap, &iter, buffer_index); which does copy_page_to_iter(). * after that iov_iter_init() we have iter.type equal to ITER_IOVEC | READ, so copy_page_to_iter() uses __copy_to_user_inatomic() or falls back to __copy_to_user(). Giving a kernel pointer to either is going to break on any architecture where separate MMU contexts are used for kernel and userland. On x86 we get away with that, but e.g. on sparc64 we won't. More to the point, why do you bother with kmap() at all? Just set a BVEC_ITER over that struct page and let copy_page_to_iter() do the right thing when it comes to it. IOW, have iov_iter passed to pvfs2_inode_read() / wait_for_direct_io() / postcopy_buffers() and be done with that - have read_one_page() do struct iov_iter iter; struct bvec bv = {.bv_page = page, .bv_len = PAGE_SIZE, .bv_offset = 0}; iov_iter_init_bvec(&iter, ITER_BVEC | READ, &bv, 1, PAGE_SIZE); max_block = ((inode->i_size / blocksize) + 1); if (page->index < max_block) { loff_t blockptr_offset = (((loff_t) page->index) << blockbits); bytes_read = pvfs2_inode_read(inode, &iter, blocksize, &blockptr_offset, inode->i_size); } /* will only zero remaining unread portions of the page data */ iov_iter_zero(~0U, &iter); and then proceed as you currently do, except that no kunmap() is needed at all. precopy_buffers() would also need to take iov_iter, and the other caller of wait_for_direct_io() (i.e. do_readv_writev()) will need to pass it an iov_iter as well. Said that, do_readv_writev() also needs some taking care of. Both callers are giving it iter->iov (which, BTW, is only valid for IOV_ITER ones), and then play very odd games with splitting iovecs if the total size exceeds your pvfs_bufmap_size_query(). What for? Sure, I understand not wanting to submit IO in chunks exceeding your limit. But why do you need a separate iovec each time? One thing that iov_iter does is handling the copying to/from partially used iovecs. copy_page_to_iter and copy_page_from_iter are just fine with that. IOW, why bother with split_iovecs() at all? wait_for_direct_io() uses 'vec' and 'nr_segs' only for passing them to {pre,post}copy_buffers(), where they are used to only to initialize iov_iter. And since copy_page_{to,from}_iter() advances the iterator, there's no need to reinitialize the damn thing at all. Just pass the amount to copy as an explicit argument, rather than using iov_iter_count() in pvfs_bufmap_copy_to_iovec(). Am I missing anything subtle in there? Looks like do_readv_writev() would get much simpler from that *and* ->read_iter()/->write_iter() will work on arbitrary iov_iter after that... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: updated orangefs tree at kernel.org 2015-10-08 21:07 ` Al Viro @ 2015-10-08 23:03 ` Al Viro 2015-10-09 3:41 ` Al Viro 2015-10-10 2:31 ` Al Viro 0 siblings, 2 replies; 13+ messages in thread From: Al Viro @ 2015-10-08 23:03 UTC (permalink / raw) To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel On Thu, Oct 08, 2015 at 10:07:45PM +0100, Al Viro wrote: > Am I missing anything subtle in there? Looks like do_readv_writev() would > get much simpler from that *and* ->read_iter()/->write_iter() will work > on arbitrary iov_iter after that... What I have in mind is something like (*ABSOLUTELY* *UNTESTED*) git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git orangefs-untested That's 8 commits on top of your #for-next: orangefs: explicitly pass the size to pvfs_bufmap_copy_to_iovec() pvfs_bufmap_copy_from_iovec(): don't rely upon size being equal to iov_iter_count(iter) orangefs: make postcopy_buffers() take iov_iter orangefs: make precopy_buffers() take iov_iter orangefs: make wait_for_direct_io() take iov_iter orangefs: don't bother with splitting iovecs orangefs: make do_readv_writev() take iov_iter orangefs: make pvfs2_inode_read() take iov_iter fs/orangefs/file.c | 344 +++++++------------------------------------------------------------------------------- fs/orangefs/inode.c | 17 ++--- fs/orangefs/pvfs2-bufmap.c | 52 ++++++------- fs/orangefs/pvfs2-bufmap.h | 3 +- fs/orangefs/pvfs2-kernel.h | 3 +- 5 files changed, 61 insertions(+), 358 deletions(-) Might take care of iov_iter-related issues (and get rid of arseloads of manual messing with iovecs, while we are at it). Again, this is really, really untested - might chew your data, etc. It compiles, but that's it. I easily might've missed something something subtle (or trivial, for that matter). And I haven't looked at the rest yet - some bugs I've mentioned are definitely still there (e.g. bogus iput() on failure of d_make_root() - it's already done by d_make_root() itself in that case, so what you are doing is double iput(); just remove it and be done with that), but I hadn't checked the whole list. Will post more later tonight... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: updated orangefs tree at kernel.org 2015-10-08 23:03 ` Al Viro @ 2015-10-09 3:41 ` Al Viro 2015-10-09 6:13 ` Al Viro 2015-10-09 19:22 ` Al Viro 2015-10-10 2:31 ` Al Viro 1 sibling, 2 replies; 13+ messages in thread From: Al Viro @ 2015-10-09 3:41 UTC (permalink / raw) To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel On Fri, Oct 09, 2015 at 12:03:33AM +0100, Al Viro wrote: > And I haven't looked at the rest yet - some bugs I've mentioned are definitely > still there (e.g. bogus iput() on failure of d_make_root() - it's already > done by d_make_root() itself in that case, so what you are doing is double > iput(); just remove it and be done with that), but I hadn't checked the > whole list. Will post more later tonight... Directories: 1) I don't understand what happens to position in directory (ctx->pos). AFAICS, you are reading directories in moderate chunks (up to 96 entries?) and use a token stored in *(file->private_data) to tell which chunk it is. So far, so good, but... AFAICS ctx->pos is reset to 0 on the beginning of each such chunk. Am I right assuming that looping through a directory with readdir(3) and watching for return values of telldir(3) will yield repeating offsets every time you go into the next chunk? 2) More seriously, rewinddir(3) *must* rewind the directory. With all Linux libc variants it means lseek() to 0; AFAICS, orangefs directories won't do it properly - all lseek attempts flat-out fail with ESPIPE. Userland won't be happy... 3) ->pvfs_dirent_outcount is u32 and it comes from server with no validation attempts ever made. Now, look at readdir->dirent_array = kmalloc(readdir->pvfs_dirent_outcount * sizeof(*readdir->dirent_array), in decode_dirents(). ->dirent_array is a structure consisting of pointer, int and a 16-byte 64bit-aligned array. IOW, it's size is greater than 1 and this product might wrap around. You really should use kcalloc() there. What's more, the loop parsing the response body is for (i = 0; i < readdir->pvfs_dirent_outcount; i++) { dec_string(pptr, &readdir->dirent_array[i].d_name, &readdir->dirent_array[i].d_length); readdir->dirent_array[i].khandle = *(struct pvfs2_khandle *) *pptr; *pptr += 16; } where dec_string(pptr, pbuf, plen) is defined as __u32 len = (*(__u32 *) *(pptr)); \ *pbuf = *(pptr) + 4; \ *(pptr) += roundup8(4 + len + 1); \ if (plen) \ *plen = len;\ Note that we do *not* validate len, so this code might end up dereferencing any address within +-2Gb from the buffer. You really can't assume that server is neither insane nor compromised; this is a blatant security hole. And please, drop these macros - you are not using enc_string() at all while dec_string() is used only in decode_dirents() and would be easier to understand if it had been spelled out right there. Especially since you need to add sanity checks (and buggering off should they fail) to it. 4) if (pos == 0) { ino = get_ino_from_khandle(dentry->d_inode); gossip_debug(GOSSIP_DIR_DEBUG, "%s: calling dir_emit of \".\" with pos = %llu\n", __func__, llu(pos)); ret = dir_emit(ctx, ".", 1, ino, DT_DIR); pos += 1; } in pvfs2_readdir() is bloody odd. What's wrong with emit_dot(), or, for that matter, dir_emit_dots() to cover both that and .. right after it. You *are* setting ->i_ino to the same hash of khandle you are using here anyway, so why not use the standard helpers? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: updated orangefs tree at kernel.org 2015-10-09 3:41 ` Al Viro @ 2015-10-09 6:13 ` Al Viro 2015-10-09 19:22 ` Al Viro 1 sibling, 0 replies; 13+ messages in thread From: Al Viro @ 2015-10-09 6:13 UTC (permalink / raw) To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel Superblock handling: 1) aside of the pointlessness of struct pvfs2_mount_sb_info_s, now that pvfs2_fill_sb() is called directly and isn't constrained to passing just one pointer argument, you are mishandling its failures. Note that mount_nodev() follows a failure of callback with deactivate_locked_super(s); pvfs2_mount() does not. It simply ends up leaking a struct super_block in such case. 2) ->kill_sb() is called for everything that had been created by sget(). IOW, your pvfs2_kill_sb() can be called for something that never got past the attempt to allocate ->s_fs_info. You seem to assume that it's only called for fully set up superblock. 3) the question about protection of pvfs2_superblocks in dispatch_ioctl_command() remains - handling of PVFS_DEV_REMOUNT_ALL loops through that list with no locks in common with the call chain leading through ->kill_sb() to remove_pvfs2_sb(). 4) ditto for pvfs2_remount() vs. pvfs2_unmount_sb() - is it OK to have the former called while the latter is running? I don't see anything that would provide an exclusion here. 5) are you sure that pvfs2_unmount_sb() should be done *before* kill_anon_super()? At that point we still might have dirty inodes in cache, etc. I don't know the protocol - can't tell if that's really OK. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: updated orangefs tree at kernel.org 2015-10-09 3:41 ` Al Viro 2015-10-09 6:13 ` Al Viro @ 2015-10-09 19:22 ` Al Viro 1 sibling, 0 replies; 13+ messages in thread From: Al Viro @ 2015-10-09 19:22 UTC (permalink / raw) To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel On Fri, Oct 09, 2015 at 04:41:26AM +0100, Al Viro wrote: > where dec_string(pptr, pbuf, plen) is defined as > __u32 len = (*(__u32 *) *(pptr)); \ > *pbuf = *(pptr) + 4; \ > *(pptr) += roundup8(4 + len + 1); \ > if (plen) \ > *plen = len;\ > > Note that we do *not* validate len, so this code might end up dereferencing > any address within +-2Gb from the buffer. You really can't assume that > server is neither insane nor compromised; this is a blatant security hole. > > And please, drop these macros - you are not using enc_string() at all while > dec_string() is used only in decode_dirents() and would be easier to > understand if it had been spelled out right there. Especially since you > need to add sanity checks (and buggering off should they fail) to it. Actually, validation is bloody weak in pvfs2_devreq_writev() as well. And frankly, the layout it's expecting is downright insane. What you have is some initial segment of * 32bit protocol version. Never checked, which renders it useless. * 32bit magic. That one _is_ checked, so basically they work together as 64bit protocol version, with bloody odd validation. * 64bit tag, used to match with requests * 32bit type, apparently never checked - that of the matching request is used * 32bit status * 64bit trailer_size, apparently used only with readdir, so probably non-zero only if type is PVFS2_VFS_OP_READDIR or PVFS2_VFS_OP_READDIRPLUS (the latter doesn't seem to be ever issued, though). * pointer-sized junk. Ignored. * big fat union *NOT* including anything for readdir possibly followed by readdir response. To make things even funnier, you require 4- or 5-element iovec array, the first 4 covering the aforementioned "some initial segment". The 5th is quietly ignored unless trailer_size is positive and status is zero. If trailer_size > 0 && status == 0, you verify that the length of the 5th segment is no more than trailer_size and copy it to vmalloc'ed buffer. Without bothering to zero the rest of that buffer out. The member of that big fat union for getattr has a bit of additional insanity - several pointer-sized gaps. Entirely unused. Far be it from me to support The War On Some Drugs, but really, staying away from the stuff that induces trips _that_ bad is just plain common sense... First of all, this "exactly 4 or 5 iovecs in array" thing is beyond ugly - it's a way to separate large variable-length segment, all right, but why the hell not just declare e.g. that if request is at least 32 bytes long and has trailer_size > 0 && status == 0, trailer_size must be no more than request size - 32 and that much bytes in the end will go into vmalloc'ed buffer? Everything else must fit into MAX_ALIGNED_DEV_REQ_DOWNSIZE. AFAICS, that would be compatible with what your server is doing now. Next, and that can't be fixed without a protocol change, I'd suggest losing those pointer-sized gaps (and probably reordering struct PVFS_sys_attr_s to make sure that all u64 are naturally aligned there). Why make your protocol wordsize-sensitive, when it's trivial to avoid? In any case, the current code is asking for serious trouble if the 5th segment is there and _shorter_ than trailer_size. As the absolute minimum we need to zero the rest of vmalloc'ed buffer, and I seriously suspect that we need to outright reject such requests. And I would really prefer to get rid of this "exactly 4 or 5" thing as well (see above)... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: updated orangefs tree at kernel.org 2015-10-08 23:03 ` Al Viro 2015-10-09 3:41 ` Al Viro @ 2015-10-10 2:31 ` Al Viro 2015-10-10 20:23 ` Al Viro 2015-10-10 23:10 ` orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org) Al Viro 1 sibling, 2 replies; 13+ messages in thread From: Al Viro @ 2015-10-10 2:31 UTC (permalink / raw) To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel On Fri, Oct 09, 2015 at 12:03:33AM +0100, Al Viro wrote: > What I have in mind is something like (*ABSOLUTELY* *UNTESTED*) > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git orangefs-untested OK, I've thrown several more fixes into that branch, but that's the low-hanging fruit. The really interesting questions I see right now are: * it needs some form of lseek() for directories - currently absent, not even rewind to the beginning. Userland won't be happy * races around the umount - I don't understand the protocol well enough to fix those, but this area looks fishy as hell * racing copy_attributes_to_inode() and the way it mangles the metadata of live inodes. ->i_mode updates in particular are seriously broken, so are updates of symlink bodies, etc. * truncation of long symlinks and the lack of NUL-termination in there. * atrocious userland API for downcalls (response to everything other than readdir must come in 4-element iovec array passed to writev(), no matter where the segment boundaries are; response to readdir must come in 5-element iovec array passed to writev(), the boundaries among the first 4 segments do not matter, the 5th segment covering exactly the payload). IMO that needs to be fixed before we merge the damn thing - it's really too ugly to live. I would really like to hear from somebody familiar with the userland side - what responses does it actually submit? The validation kernel-side is basically inexistent, and while I suspect that we could handle the actually sent responses much cleaner, the full set of everything that would be accepted by the current code is a bloody mess and would be much harder to handle in a clean way. What's more, the response layouts are messy, and if it is still possible to change that API, we'd be much better off if we cleaned them up. * for some reason the lookup request tells the server whether there was LOOKUP_FOLLOW in flags. This is really odd - no other filesystem cares about that bit and until now its presence in ->lookup() flags had been basically at convenience of fs/namei.c; it doesn't match anything useful and I'm very surprised by seeing orangefs pass it to server. LOOKUP_FOLLOW is almost certainly a wrong approximation to whatever orangefs is trying to achieve. What is it about? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: updated orangefs tree at kernel.org 2015-10-10 2:31 ` Al Viro @ 2015-10-10 20:23 ` Al Viro 2015-10-10 23:10 ` orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org) Al Viro 1 sibling, 0 replies; 13+ messages in thread From: Al Viro @ 2015-10-10 20:23 UTC (permalink / raw) To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel May I politely inquire about the intended meaning of the following definition? #define MAX_ALIGNED_DEV_REQ_DOWNSIZE \ (MAX_DEV_REQ_DOWNSIZE + \ ((((MAX_DEV_REQ_DOWNSIZE / \ (BITS_PER_LONG_DIV_8)) * \ (BITS_PER_LONG_DIV_8)) + \ (BITS_PER_LONG_DIV_8)) - \ MAX_DEV_REQ_DOWNSIZE)) 1) (x + (y - x)) is more commonly spelled as y 2) ((x / y) * y + y) might not be doing what you think it is doing 3) aligning the thing to multiple of sizeof(long) is an odd thing to do, seeing that the value being aligned is 16 + sizeof of a structure that contains pointers (which is an atrocity in its own right, BTW). 4) on the related note, what is the meaning of static int max_downsize = MAX_ALIGNED_DEV_REQ_DOWNSIZE; int ret = 0, num_remaining = max_downsize; (with no code other code touching that max_downsize thing)? 5) having collected the contents of the first 4 segments of your writev() argument and having verified that their total size does not exceed MAX_ALIGNED_DEV_REQ_DOWNSIZE, you proceed to do this: payload_size -= (2 * sizeof(__s32) + sizeof(__u64)); if (payload_size <= sizeof(struct pvfs2_downcall_s)) /* copy the passed in downcall into the op */ memcpy(&op->downcall, ptr, sizeof(struct pvfs2_downcall_s)); else gossip_debug(GOSSIP_DEV_DEBUG, "writev: Ignoring %d bytes\n", payload_size); upon the entry payload_size contains the amount of data collected. What is the intended meaning of the 'else' branch in there? Note that it *is* possible to hit, exactly because MAX_ALIGNED_DEV_REQ_DOWNSIZE is sizeof(long) greater than 16 + sizeof(struct pvfs2_downcall_s). What should happen if we do hit it? I do understand the "we shall whine" part, but then what? Leave the op->downcall uninitialized and proceed with whatever might've been there? That code does marshalling of userland-supplied data. Worse, the ABI is both undocumented and utterly... I believe "idiosyncratic" would be a printable term for that kind of thing. Sigh... ^ permalink raw reply [flat|nested] 13+ messages in thread
* orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org) 2015-10-10 2:31 ` Al Viro 2015-10-10 20:23 ` Al Viro @ 2015-10-10 23:10 ` Al Viro 2015-10-12 16:20 ` Mike Marshall 1 sibling, 1 reply; 13+ messages in thread From: Al Viro @ 2015-10-10 23:10 UTC (permalink / raw) To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel On Sat, Oct 10, 2015 at 03:31:57AM +0100, Al Viro wrote: > * atrocious userland API for downcalls (response to everything > other than readdir must come in 4-element iovec array passed to writev(), > no matter where the segment boundaries are; response to readdir must come > in 5-element iovec array passed to writev(), the boundaries among the > first 4 segments do not matter, the 5th segment covering exactly the payload). > IMO that needs to be fixed before we merge the damn thing - it's really too > ugly to live. I would really like to hear from somebody familiar with the > userland side - what responses does it actually submit? The validation > kernel-side is basically inexistent, and while I suspect that we could handle > the actually sent responses much cleaner, the full set of everything that > would be accepted by the current code is a bloody mess and would be much > harder to handle in a clean way. What's more, the response layouts are > messy, and if it is still possible to change that API, we'd be much better off > if we cleaned them up. Another API bogosity: when the 5th segment is present, successful writev() returns the sum of sizes of the first 4. Userland side just ignores that - everything positive is treated as "everything's written". Another piece of fun: header too large by less than sizeof(long) => print a warning, leave op->downcall completely uninitialized and proceed. Below is what the kernel side is actually doing: * less than 4 segments or more than 5 => whine and fail * if concatenation of the first 4 segments is longer than 16 + sizeof(struct pvfs2_downcall_s) + sizeof(long) => whine and fail * if concatenation of the first 4 segments is longer than 16 + sizeof(struct pvfs2_downcall_s) by no more than sizeof(long) => whine and proceed with garbage. * if concatenation of the first 4 segments is no longer than 16 + sizeof(struct pvfs2_downcall_s), it is zero-padded to that size, then the first 16 bytes are interpreted as 32bit protocol version (ignored) + 32bit magic (checked for one specific value) + 64bit tag (used to find the matching request). The rest is copied into op->downcall. * if the 32bit value 4 bytes into op->downcall is zero and 64bit value following it is non-zero, the latter is interpreted as the size of trailer data. * if there's no trailer, the 5th segment (if present) is completely ignored. Otherwise it must be present and is expected to contain the trailer data. * if the 5th segment is longer than expected trailer data => whine and fail (note that if the amount of expected trailer data is zero a non-empty 5th segment is quietly ignored). * if trailer expected, vmalloc a buffer for it and copy the 5th segment contents in there; in case of the 5th segment being *shorter* than expected trailer data, leave the rest of the buffer uninitialized (i.e. expose the contents of random previously freed pages). * if vmalloc fails, act as if status (32bit at offset 5 into op->downcall) had been -ENOMEM and don't look at the 5th segment at all. * in all cases ignore the amount of data taken from the 5th segment; the return value is the sum of sizes of the first 4. What the current userland side appears to sends is 4-segment version/magic/tag/struct pvfs2_downcall_s + possibly the 5th segment covering the readdir results. Trailer size in struct pvfs2_downcall_s actually is equal to the size of the 5th segment if present and 0 otherwise. The 4th segment (struct pvfs2_downcall_s) contains a bunch of garbage never used by the kernel (pointers and misalignment gaps). How much of that is guaranteed is, of course, known only to the orangefs folks. I only looked at the current version of the userland code; hadn't checked the older ones and have no idea what kind of changes might be planned. Folks, userland interfaces need to be documented (and preferably - contain less ugliness). At that point I think that we need the damn API written up and reviewed, or it's a strong NAK. Once it's in the tree, it's cast in stone, so we'd better get it right and we need it explicitly documented. "Whatever orangefs userland code does" doesn't cut it, especially since your protocol version check is inexistent. Digging through the entire history of your userland code and trying to deduce what would and what would not be OK to change kernel-side is far past reasonable. I'm not asking for the description of semantics for individual requests and responses (it would be nice to have as well, but that's a separate story), but the rules for data marshalling are really needed. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org) 2015-10-10 23:10 ` orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org) Al Viro @ 2015-10-12 16:20 ` Mike Marshall 2015-10-12 19:16 ` Al Viro 0 siblings, 1 reply; 13+ messages in thread From: Mike Marshall @ 2015-10-12 16:20 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel > ... the ABI is ... utterly... "idiosyncratic"... Al, do you mean idiosyncratic in a "distinctive and special" way, or a "strange and bizarre" way? <g> When I first started working with the kernel module, I grew to think of the service_operation as the event horizon to a black hole, but I figured that was just me... First off, Al, I want to thank you for spending so much time on the review. You have pointed out numerous ways to make the kernel module better. Most importantly, though, I believe you are saying (and I'm purposely restating your words here to make sure we're on the same page) that there will be no ack until we provide a usable detailed definition of what is in every upcall and downcall and ioctl between the kernel module and the userspace daemon. Not only do we need, for example, to specify how LOOKUP_FOLLOW needs to be provided in a lookup request, but it would be nice to also say why. You're not promising an ack in return, but we can't go further until we provide the definition. > Once [the userspace interface is] in the tree, it's cast in > stone, so we'd better get it right and we need it explicitly > documented. Along these lines, we've done some well-intentioned "thrashing around" (not Linus' favorite development model) towards ensuring that an upstream kernel module will continue to work into the future. The "khandle" code is an attempt to ensure that things will continue to work when userspace transitions to 128 bit uuids for handles, instead of the current 64 bit handles - we use handles in the kernel module as inode numbers. And I added an ioctl that allows the kernel module to become aware of userspace's ever changing list of debug modes when the userspace daemon is first started. Our intent to keep the upstream kernel module working with userspace, current and future, is also reflected in the upstream kernel modules' handling of the the protocol version check - if userspace sees that the kernel module is the upstream version then charge ahead because we've planned not to change the protocol. Anyhow, Martin and I plan to work on the protocol definition, which may/might/probably-will lead to changes in the code to make the protocol more sane. This will probably lead to us missing the next merge window. And, as we proceed, we might ask for feedback on what we are doing from Al and fs-devel... Al, does this sound like how you think we should focus our energy at this point? -Mike On Sat, Oct 10, 2015 at 7:10 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Sat, Oct 10, 2015 at 03:31:57AM +0100, Al Viro wrote: >> * atrocious userland API for downcalls (response to everything >> other than readdir must come in 4-element iovec array passed to writev(), >> no matter where the segment boundaries are; response to readdir must come >> in 5-element iovec array passed to writev(), the boundaries among the >> first 4 segments do not matter, the 5th segment covering exactly the payload). >> IMO that needs to be fixed before we merge the damn thing - it's really too >> ugly to live. I would really like to hear from somebody familiar with the >> userland side - what responses does it actually submit? The validation >> kernel-side is basically inexistent, and while I suspect that we could handle >> the actually sent responses much cleaner, the full set of everything that >> would be accepted by the current code is a bloody mess and would be much >> harder to handle in a clean way. What's more, the response layouts are >> messy, and if it is still possible to change that API, we'd be much better off >> if we cleaned them up. > > Another API bogosity: when the 5th segment is present, successful writev() > returns the sum of sizes of the first 4. Userland side just ignores that - > everything positive is treated as "everything's written". > > Another piece of fun: header too large by less than sizeof(long) => print > a warning, leave op->downcall completely uninitialized and proceed. > > Below is what the kernel side is actually doing: > * less than 4 segments or more than 5 => whine and fail > * if concatenation of the first 4 segments is longer than > 16 + sizeof(struct pvfs2_downcall_s) + sizeof(long) => whine and fail > * if concatenation of the first 4 segments is longer than > 16 + sizeof(struct pvfs2_downcall_s) by no more than sizeof(long) => whine > and proceed with garbage. > * if concatenation of the first 4 segments is no longer than > 16 + sizeof(struct pvfs2_downcall_s), it is zero-padded to that size, > then the first 16 bytes are interpreted as 32bit protocol version (ignored) > + 32bit magic (checked for one specific value) + 64bit tag (used to find the > matching request). The rest is copied into op->downcall. > * if the 32bit value 4 bytes into op->downcall is zero and 64bit > value following it is non-zero, the latter is interpreted as the size of > trailer data. > * if there's no trailer, the 5th segment (if present) is completely > ignored. Otherwise it must be present and is expected to contain the trailer > data. > * if the 5th segment is longer than expected trailer data => whine > and fail (note that if the amount of expected trailer data is zero a non-empty > 5th segment is quietly ignored). > * if trailer expected, vmalloc a buffer for it and copy the 5th > segment contents in there; in case of the 5th segment being *shorter* than > expected trailer data, leave the rest of the buffer uninitialized (i.e. > expose the contents of random previously freed pages). > * if vmalloc fails, act as if status (32bit at offset 5 into > op->downcall) had been -ENOMEM and don't look at the 5th segment at all. > * in all cases ignore the amount of data taken from the 5th segment; > the return value is the sum of sizes of the first 4. > > What the current userland side appears to sends is 4-segment > version/magic/tag/struct pvfs2_downcall_s + possibly the 5th segment covering > the readdir results. Trailer size in struct pvfs2_downcall_s actually is > equal to the size of the 5th segment if present and 0 otherwise. > The 4th segment (struct pvfs2_downcall_s) contains a bunch of garbage > never used by the kernel (pointers and misalignment gaps). > How much of that is guaranteed is, of course, known only to the > orangefs folks. I only looked at the current version of the userland code; > hadn't checked the older ones and have no idea what kind of changes might > be planned. > > Folks, userland interfaces need to be documented (and preferably - > contain less ugliness). At that point I think that we need the damn API > written up and reviewed, or it's a strong NAK. Once it's in the tree, it's > cast in stone, so we'd better get it right and we need it explicitly > documented. "Whatever orangefs userland code does" doesn't cut it, especially > since your protocol version check is inexistent. Digging through the entire > history of your userland code and trying to deduce what would and what would > not be OK to change kernel-side is far past reasonable. > > I'm not asking for the description of semantics for individual requests > and responses (it would be nice to have as well, but that's a separate story), > but the rules for data marshalling are really needed. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org) 2015-10-12 16:20 ` Mike Marshall @ 2015-10-12 19:16 ` Al Viro 2015-10-13 17:46 ` Mike Marshall 0 siblings, 1 reply; 13+ messages in thread From: Al Viro @ 2015-10-12 19:16 UTC (permalink / raw) To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel On Mon, Oct 12, 2015 at 12:20:42PM -0400, Mike Marshall wrote: > > ... the ABI is ... utterly... "idiosyncratic"... > > Al, do you mean idiosyncratic in a "distinctive and special" way, > or a "strange and bizarre" way? <g> The latter, but much stronger ;-) > When I first started working with the kernel module, I grew > to think of the service_operation as the event horizon to > a black hole, but I figured that was just me... *snort* Anything you get back is, by definition, from before the event horizon... > First off, Al, I want to thank you for spending so much > time on the review. You have pointed out numerous ways > to make the kernel module better. Most importantly, though, > I believe you are saying (and I'm purposely restating > your words here to make sure we're on the same page) that there > will be no ack until we provide a usable detailed definition of > what is in every upcall and downcall and ioctl between > the kernel module and the userspace daemon. Not only do we need, > for example, to specify how LOOKUP_FOLLOW needs to be provided > in a lookup request, but it would be nice to also say why. LOOKUP_FOLLOW is a separate story - the issue here is that you are exposing internal details of the pathname resolution to the userland and that's essentially a demand to keep them stable from now on; however, I'm seriously at loss as to _what_ to keep stable. "This call of ->lookup() gets LOOKUP_FOLLOW in flags" doesn't map to any natural semantics I can think of; it might accidentally happen to match something useful, but I don't see what it is. Your userland code does something rather odd with it, but it's buried inside your state machine and I don't understand it well enough to tell what's going on. Is it trying to follow symlinks on its own upon lookup? If so, it's _very_ bogus. If nothing else, userland code doesn't have enough information to do that - it has no idea what the originating process is chrooted to, for starters. As the result, having your filesystem mounted inside a chroot jail would open a nice way for non-priveleged process to break out - just create an absolute symlink somewhere on that filesystem and go through it. Again, I'm not sure what that userland code is doing, so it (hopefully) might be something completely different. So yes, I really do want details on this one, but for a different reason. What I want from API documentation is something along the lines of "writev() on /dev/pvfs2-req is used to pass responses to the requests made by the kernel side; the data structure being passed is represented in the following way <...>; representations do not statisfying constraints <...> are to be rejected". What you do with the payload _after_ it's been transferred is a separate story; it's also nice to have, but right now we don't even have the calling conventions, nevermind the description of behaviour. How is one supposed to review the marshalling code? Sure, I can tell that if you have the iov[4].iov_len greater than .trailer_size, you end up with uninitialized tail of vmalloc'ed buffer. I can also see that your userland code doesn't ever pass a positive .trailer_size with iov[4].iov_len different from it. What's the proper fix? To declare that they must always be equal, and reject mismatches, as you already do for inequality in other direction? Or should we quietly zero-pad the buffer? Either will work for your current userland code, but only you can tell which is the right fix - I don't know what the older versions used to do and I don't know what changes are planned. And yes, I understand the desire to separate a potentially large payload kernel-side; is there any deep reason why one doesn't simply use .trailer_size to decide how much in the end of the area being transferred should go there? Are there any plans for having the first 4 segments shorter than the full 16 + sizeof(struct pvfs2_downcall_s)? If so, what's the minimal amount to be expected in all cases? I'm not nitpicking here - having e.g. short packets ending at (non-zero) .status would work with the current kernel code and isn't inconceivable as a planned change. For that matter, what's .type for? It looks like a selector for that union you have in the end, but it is not used that way - originating upcall is used for that instead. Is it just a junk, or is it a currently unimplemented sanity check (if .type in downcall must be the same as in matching upcall)? > You're not promising an ack in return, but we can't go further > until we provide the definition. *shrug* Sure, I can try and pull it out of you guys bit by bit instead (see above for some initial questions), but I suspect that it would be faster and less painful for everyone involved if you just describe the calling conventions and we'll see what might be missing. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org) 2015-10-12 19:16 ` Al Viro @ 2015-10-13 17:46 ` Mike Marshall 2015-10-13 23:34 ` Al Viro 0 siblings, 1 reply; 13+ messages in thread From: Mike Marshall @ 2015-10-13 17:46 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel > *shrug* Al, I guess what I wrote sounded pi**y, but I promise my attitude is 180 degrees away from that... I'm just trying to focus on what is most important... you've identified numerous issues, all of which need to be addressed, but the state of the API used by userspace is a show stopper: it needs documented, and then probably improved... When we have something coherent started, we'll send it your way so you can help make sure we're on the right track... Thanks! -Mike On Mon, Oct 12, 2015 at 3:16 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Mon, Oct 12, 2015 at 12:20:42PM -0400, Mike Marshall wrote: >> > ... the ABI is ... utterly... "idiosyncratic"... >> >> Al, do you mean idiosyncratic in a "distinctive and special" way, >> or a "strange and bizarre" way? <g> > > The latter, but much stronger ;-) > >> When I first started working with the kernel module, I grew >> to think of the service_operation as the event horizon to >> a black hole, but I figured that was just me... > > *snort* > > Anything you get back is, by definition, from before the event horizon... > >> First off, Al, I want to thank you for spending so much >> time on the review. You have pointed out numerous ways >> to make the kernel module better. Most importantly, though, >> I believe you are saying (and I'm purposely restating >> your words here to make sure we're on the same page) that there >> will be no ack until we provide a usable detailed definition of >> what is in every upcall and downcall and ioctl between >> the kernel module and the userspace daemon. Not only do we need, >> for example, to specify how LOOKUP_FOLLOW needs to be provided >> in a lookup request, but it would be nice to also say why. > > LOOKUP_FOLLOW is a separate story - the issue here is that you are > exposing internal details of the pathname resolution to the userland > and that's essentially a demand to keep them stable from now on; however, > I'm seriously at loss as to _what_ to keep stable. "This call of ->lookup() > gets LOOKUP_FOLLOW in flags" doesn't map to any natural semantics I can > think of; it might accidentally happen to match something useful, but I > don't see what it is. > > Your userland code does something rather odd with it, but it's buried inside > your state machine and I don't understand it well enough to tell what's going > on. Is it trying to follow symlinks on its own upon lookup? If so, it's > _very_ bogus. If nothing else, userland code doesn't have enough information > to do that - it has no idea what the originating process is chrooted to, > for starters. As the result, having your filesystem mounted inside a chroot > jail would open a nice way for non-priveleged process to break out - just > create an absolute symlink somewhere on that filesystem and go through it. > Again, I'm not sure what that userland code is doing, so it (hopefully) > might be something completely different. > > So yes, I really do want details on this one, but for a different reason. > > What I want from API documentation is something along the lines of "writev() > on /dev/pvfs2-req is used to pass responses to the requests made by the > kernel side; the data structure being passed is represented in the following > way <...>; representations do not statisfying constraints <...> are to be > rejected". > > What you do with the payload _after_ it's been transferred is a separate > story; it's also nice to have, but right now we don't even have the calling > conventions, nevermind the description of behaviour. > > How is one supposed to review the marshalling code? Sure, I can tell that > if you have the iov[4].iov_len greater than .trailer_size, you end up with > uninitialized tail of vmalloc'ed buffer. I can also see that your userland > code doesn't ever pass a positive .trailer_size with iov[4].iov_len different > from it. What's the proper fix? To declare that they must always be equal, > and reject mismatches, as you already do for inequality in other direction? > Or should we quietly zero-pad the buffer? Either will work for your current > userland code, but only you can tell which is the right fix - I don't know > what the older versions used to do and I don't know what changes are planned. > > And yes, I understand the desire to separate a potentially large payload > kernel-side; is there any deep reason why one doesn't simply use .trailer_size > to decide how much in the end of the area being transferred should go there? > > Are there any plans for having the first 4 segments shorter than the full > 16 + sizeof(struct pvfs2_downcall_s)? If so, what's the minimal amount to > be expected in all cases? I'm not nitpicking here - having e.g. short > packets ending at (non-zero) .status would work with the current kernel > code and isn't inconceivable as a planned change. > > For that matter, what's .type for? It looks like a selector for that > union you have in the end, but it is not used that way - originating > upcall is used for that instead. Is it just a junk, or is it a currently > unimplemented sanity check (if .type in downcall must be the same as in > matching upcall)? > >> You're not promising an ack in return, but we can't go further >> until we provide the definition. > > *shrug* > > Sure, I can try and pull it out of you guys bit by bit instead (see above > for some initial questions), but I suspect that it would be faster and less > painful for everyone involved if you just describe the calling conventions > and we'll see what might be missing. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org) 2015-10-13 17:46 ` Mike Marshall @ 2015-10-13 23:34 ` Al Viro 0 siblings, 0 replies; 13+ messages in thread From: Al Viro @ 2015-10-13 23:34 UTC (permalink / raw) To: Mike Marshall; +Cc: Linus Torvalds, linux-fsdevel On Tue, Oct 13, 2015 at 01:46:54PM -0400, Mike Marshall wrote: > > *shrug* > > Al, I guess what I wrote sounded pi**y, but I promise my attitude > is 180 degrees away from that... I'm just trying to focus on what is > most important... you've identified numerous issues, all of which > need to be addressed, but the state of the API used by userspace > is a show stopper: it needs documented, and then probably improved... > > When we have something coherent started, we'll send it your way > so you can help make sure we're on the right track... No problem... BTW, one note regarding the iov_iter - unlike an array of iovec, things like "copy <that many> bytes" are easy; you don't need the thing to start on the iovec boundary or anything like that. Simple copy_from_iter(dest, size, iter) will take care of everything. Something like struct { __u32 version; __u32 magic; __u64 tag; } head; total = iov_iter_count(iter); if (total < 24) short packet n = copy_from_iter(&head, 16, iter); if (n != 16) failed copy <check head.magic, find the matching upcall by head.tag> if (not found) stray response /* get the beginning of response proper */ memset(op->downcall, 0, 16); n = copy_from_iter(&op->downcall, 16, iter); if (n < 16) { /* it might be really short... */ if (iov_iter_count()) failed copy /* yes, and that's too short to have a trailer */ /* just zero-pad and that's it */ memset((char *)&op->downcall + n, 0, sizeof(op->downcall) - n); goto done; } /* will there be a trailer? */ trailer_count = 0; if (op->downcall.status == 0 && op->downcall.trailer_count) { trailer_count = op->downcall.trailer_count; if (total - 32 < trailer_count) packet too short buf = vmalloc(trailer_count) if (!buf) op->downcall.status = -ENOMEM; } n = total - 32 - trailer_count; if (sizeof(op->downcall) - 16 < n) packet too long if (copy_from_iter((char *)op->downcall + 16, 0, n) != n) failed copy memset((char *)op->downcall + 16 + n, 0, sizeof(op->downcall) - n - 16); if (trailer_count) { /* read or skip the trailer */ if (!buf) { iov_iter_advance(trailer_count, iter); } else { n = copy_from_iter(buf, trailer_count, iter); if (n < trailer_count) failed copy } } done: ... return total; would do marshalling and accept all cases where your current code doesn't produce an obvious breakage. It's a lot more flexible than your current *userland* code needs and probably more flexible than it would make sense to have; e.g. if packets are not allowed to be just 24 bytes long, we could turn that if (n < 16) {...} into if (n != 16) failed copy, if we can always assume that packet is at least 16 + sizeof(op->downcall), if could be simplified even more, etc. There's no real benefit in using the boundary of the magic 5th segment to tell where the trailer starts. The above is just a demo, but hopefully it does demonstrate how to use those primitives. Basically, use copy_from_iter() as you would use read() on stdin when writing a userland filter, with iov_iter_advance() used to skip a given amount and iov_iter_count() telling how much input is left. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2015-10-13 23:34 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-10-07 20:07 updated orangefs tree at kernel.org Mike Marshall 2015-10-08 21:07 ` Al Viro 2015-10-08 23:03 ` Al Viro 2015-10-09 3:41 ` Al Viro 2015-10-09 6:13 ` Al Viro 2015-10-09 19:22 ` Al Viro 2015-10-10 2:31 ` Al Viro 2015-10-10 20:23 ` Al Viro 2015-10-10 23:10 ` orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org) Al Viro 2015-10-12 16:20 ` Mike Marshall 2015-10-12 19:16 ` Al Viro 2015-10-13 17:46 ` Mike Marshall 2015-10-13 23:34 ` Al Viro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).