* Re: [PATCH] zerocopy NFS updated [not found] <20020410.190550.83626375.taka@valinux.co.jp.suse.lists.linux.kernel> @ 2002-04-10 19:32 ` Andi Kleen 2002-04-11 2:30 ` David S. Miller 0 siblings, 1 reply; 54+ messages in thread From: Andi Kleen @ 2002-04-10 19:32 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: linux-kernel Hirokazu Takahashi <taka@valinux.co.jp> writes: > But I wonder about sendpage. I guess HW IP checksum for outgoing > pages might be miscalculated as VFS can update them anytime. > New feature like COW pagecache should be added to VM and they > should be duplicated in this case. For hw checksums it should not be a problem. NICs usually load the packet into their packet fifo and compute the checksum on the fly and then patch it into the header in the fifo before sending it out. A NIC that would do slow PCI bus mastering twice just to compute the checksum would be very dumb and I doubt they exist (if yes I bet it would be faster to do software checksumming on them). When the NIC only accesses the memory once there is no race window. -Andi ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-10 19:32 ` [PATCH] zerocopy NFS updated Andi Kleen @ 2002-04-11 2:30 ` David S. Miller 2002-04-11 6:46 ` Hirokazu Takahashi 0 siblings, 1 reply; 54+ messages in thread From: David S. Miller @ 2002-04-11 2:30 UTC (permalink / raw) To: ak; +Cc: taka, linux-kernel From: Andi Kleen <ak@suse.de> Date: 10 Apr 2002 21:32:22 +0200 For hw checksums it should not be a problem. NICs usually load the packet into their packet fifo and compute the checksum on the fly and then patch it into the header in the fifo before sending it out. A NIC that would do slow PCI bus mastering twice just to compute the checksum would be very dumb and I doubt they exist (if yes I bet it would be faster to do software checksumming on them). When the NIC only accesses the memory once there is no race window. Aha, but in the NFS case what if the page in the page cache gets truncated from the file before the SKB is given to the card? It would be quite easy to add such a test case to connectathon :-) See, we hold a reference to the page in the SKB, but this only guarentees that it cannot be freed up reused for another purpose. It does not prevent the page contents from being sent out long after it is no longer a part of that file. Samba has similar issues, which is why they only use sendfile() when the client holds an OP lock on the file. (Although the Samba issue is that in the same packet they mention the length of the file plus the contents). I'm still not %100 convinced this behavior would be illegal in the NFS case, it needs more deep thought than I can provide right now. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-11 2:30 ` David S. Miller @ 2002-04-11 6:46 ` Hirokazu Takahashi 2002-04-11 6:48 ` David S. Miller 0 siblings, 1 reply; 54+ messages in thread From: Hirokazu Takahashi @ 2002-04-11 6:46 UTC (permalink / raw) To: davem; +Cc: ak, linux-kernel Hi Thank you for the replies. ak> From: Andi Kleen <ak@suse.de> ak> Date: 10 Apr 2002 21:32:22 +0200 ak> ak> For hw checksums it should not be a problem. NICs usually load ak> the packet into their packet fifo and compute the checksum on the fly ak> and then patch it into the header in the fifo before sending it out. A ak> NIC that would do slow PCI bus mastering twice just to compute the checksum ak> would be very dumb and I doubt they exist (if yes I bet it would be ak> faster to do software checksumming on them). When the NIC only ak> accesses the memory once there is no race window. davem> Aha, but in the NFS case what if the page in the page cache gets davem> truncated from the file before the SKB is given to the card? davem> It would be quite easy to add such a test case to connectathon :-) davem> davem> See, we hold a reference to the page in the SKB, but this only davem> guarentees that it cannot be freed up reused for another purpose. davem> It does not prevent the page contents from being sent out long davem> after it is no longer a part of that file. I believe it's probably OK. We considered as follows. Please consider a knfsd sends data of File A by useing sendmsg(). 1. The knfsd copies the data of File A into sk_buff. 2. File A may be truncated after step.1 3. NFS clients receives the packets of File A which is already truncated. Next please consider the knfsd sends data of File A by useing sendpage(). 1. The knfsd grabs pages of File A. (page_cache_get) 2. File A may be truncated after step.1 3. The knfsd send the pages. 4. NFS clients receives the packets of File A which is already truncated. Is there any differences between them ? This behavior is invisible to NFS Clients, I think. davem> Samba has similar issues, which is why they only use sendfile() davem> when the client holds an OP lock on the file. (Although the Samba davem> issue is that in the same packet they mention the length of the file davem> plus the contents). I think NFSD is a part of kernel -- not a usermode process, so NFSD can control to avoid this kind of situations. New zerocopy knfsd grabs pages of file and its atrribute in a same operation. I think no discrepancy between them would occur. And yes I know sending pages might be overwritten by another process, but it maybe also happens on local filesystems. File data can be updated while another process is reading the same file. davem> I'm still not %100 convinced this behavior would be illegal in the davem> NFS case, it needs more deep thought than I can provide right now. I'm happy to talk to you. Thank you, Hirokazu Takahashi ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-11 6:46 ` Hirokazu Takahashi @ 2002-04-11 6:48 ` David S. Miller 2002-04-11 7:41 ` Hirokazu Takahashi 2002-04-12 12:30 ` Hirokazu Takahashi 0 siblings, 2 replies; 54+ messages in thread From: David S. Miller @ 2002-04-11 6:48 UTC (permalink / raw) To: taka; +Cc: ak, linux-kernel From: Hirokazu Takahashi <taka@valinux.co.jp> Date: Thu, 11 Apr 2002 15:46:51 +0900 (JST) Please consider a knfsd sends data of File A by useing sendmsg(). 1. The knfsd copies the data of File A into sk_buff. 2. File A may be truncated after step.1 3. NFS clients receives the packets of File A which is already truncated. Next please consider the knfsd sends data of File A by useing sendpage(). 1. The knfsd grabs pages of File A. (page_cache_get) 2. File A may be truncated after step.1 3. The knfsd send the pages. 4. NFS clients receives the packets of File A which is already truncated. Is there any differences between them ? This behavior is invisible to NFS Clients, I think. Consider truncate() to 1 byte left in that page. To handle mmap()'s of this file the kernel will memset() rest of the page to zero. Now, in the sendfile() case the NFS client sees some page filled mostly of zeros instead of file contents. In sendmsg() knfd case, client sees something reasonable. He will see something that was actually in the file at some point in time. The sendfile() case sees pure garbage, contents that never were in the file at any point in time. We could make knfsd take the write semaphore on the inode until client is known to get the packet but that is the kind of overhead we'd like to avoid. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-11 6:48 ` David S. Miller @ 2002-04-11 7:41 ` Hirokazu Takahashi 2002-04-11 7:52 ` David S. Miller 2002-04-12 12:30 ` Hirokazu Takahashi 1 sibling, 1 reply; 54+ messages in thread From: Hirokazu Takahashi @ 2002-04-11 7:41 UTC (permalink / raw) To: davem; +Cc: ak, linux-kernel Hi, davem> Consider truncate() to 1 byte left in that page. To handle mmap()'s davem> of this file the kernel will memset() rest of the page to zero. davem> Now, in the sendfile() case the NFS client sees some page filled davem> mostly of zeros instead of file contents. Hmmm... I realize clearly. davem> In sendmsg() knfd case, client sees something reasonable. He will davem> see something that was actually in the file at some point in time. davem> The sendfile() case sees pure garbage, contents that never were in davem> the file at any point in time. davem> davem> We could make knfsd take the write semaphore on the inode until client davem> is known to get the packet but that is the kind of overhead we'd like davem> to avoid. Yes, the write semaphore would be good solution if TCP/IP stack never get stuck. Now I wonder if we could make these pages COW mode. When some process try to update the pages, they should be duplicated. I's easy to implement it in write(), truncate() and so on. But mmap() is little bit difficult if there no reverse mapping page to PTE. How do you think about this idea? Regards, Hirokazu Takahashi ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-11 7:41 ` Hirokazu Takahashi @ 2002-04-11 7:52 ` David S. Miller 2002-04-11 11:38 ` Hirokazu Takahashi 2002-04-11 17:33 ` Benjamin LaHaise 0 siblings, 2 replies; 54+ messages in thread From: David S. Miller @ 2002-04-11 7:52 UTC (permalink / raw) To: taka; +Cc: ak, linux-kernel From: Hirokazu Takahashi <taka@valinux.co.jp> Date: Thu, 11 Apr 2002 16:41:34 +0900 (JST) Now I wonder if we could make these pages COW mode. When some process try to update the pages, they should be duplicated. I's easy to implement it in write(), truncate() and so on. But mmap() is little bit difficult if there no reverse mapping page to PTE. How do you think about this idea? I think this idea has such high overhead that it is even not for consideration, consider SMP. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-11 7:52 ` David S. Miller @ 2002-04-11 11:38 ` Hirokazu Takahashi 2002-04-11 11:36 ` David S. Miller 2002-04-11 17:33 ` Benjamin LaHaise 1 sibling, 1 reply; 54+ messages in thread From: Hirokazu Takahashi @ 2002-04-11 11:38 UTC (permalink / raw) To: davem; +Cc: ak, linux-kernel Hi, David davem> Now I wonder if we could make these pages COW mode. davem> When some process try to update the pages, they should be duplicated. davem> I's easy to implement it in write(), truncate() and so on. davem> But mmap() is little bit difficult if there no reverse mapping page to PTE. davem> davem> How do you think about this idea? davem> davem> I think this idea has such high overhead that it is even not for davem> consideration, consider SMP. Hmmm... If I'd implement them..... How about following codes ? nfsd read() { : page_cache_get(page); if (page is mapped to anywhere) page = duplicate_and_rehash(page); else { page_lock(page); page->flags |= COW; page_unlock(page); } sendpage(page); page_cache_release(page); } generic_file_write() { page = _grab_cache_page() lock_page(page); if (page->flags & COW) page = duplicate_and_rehash(page); prepare_write(); commit_write(); UnlockPage(page); page_cache_release(page) } truncate_list_page() <-- truncate() calls { page_cache_get(); lock_page(page); if (page->flags & COW) page = duplicate_and_rehash(page); truncate_partial_page(); UnlockPage(page); page_cache_release(page); } ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-11 11:38 ` Hirokazu Takahashi @ 2002-04-11 11:36 ` David S. Miller 2002-04-11 18:00 ` Denis Vlasenko 0 siblings, 1 reply; 54+ messages in thread From: David S. Miller @ 2002-04-11 11:36 UTC (permalink / raw) To: taka; +Cc: ak, linux-kernel From: Hirokazu Takahashi <taka@valinux.co.jp> Date: Thu, 11 Apr 2002 20:38:23 +0900 (JST) Hmmm... If I'd implement them..... How about following codes ? nfsd read() { : page_cache_get(page); if (page is mapped to anywhere) page = duplicate_and_rehash(page); else { page_lock(page); page->flags |= COW; page_unlock(page); } sendpage(page); page_cache_release(page); } What if a process mmap's the page between duplicate_and_rehash and the card actually getting the data? This is hopeless. The whole COW idea is 1) expensive 2) complex to implement. This is why we don't implement sendfile with anything other than a simple page reference. Otherwise the overhead and complexity is unacceptable. No, you must block truncate operations on the file until the client ACK's the nfsd read request if you wish to use sendfile() with nfsd. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-11 11:36 ` David S. Miller @ 2002-04-11 18:00 ` Denis Vlasenko 2002-04-11 13:16 ` Andi Kleen 0 siblings, 1 reply; 54+ messages in thread From: Denis Vlasenko @ 2002-04-11 18:00 UTC (permalink / raw) To: David S. Miller, taka; +Cc: ak, linux-kernel On 11 April 2002 09:36, David S. Miller wrote: > No, you must block truncate operations on the file until the client > ACK's the nfsd read request if you wish to use sendfile() with > nfsd. Which shouldn't be a big performance problem unless I am unaware of some real-life applications doing heavy truncates. -- vda ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-11 18:00 ` Denis Vlasenko @ 2002-04-11 13:16 ` Andi Kleen 2002-04-11 17:36 ` Benjamin LaHaise 2002-04-16 0:17 ` Mike Fedyk 0 siblings, 2 replies; 54+ messages in thread From: Andi Kleen @ 2002-04-11 13:16 UTC (permalink / raw) To: Denis Vlasenko; +Cc: David S. Miller, taka, ak, linux-kernel On Thu, Apr 11, 2002 at 04:00:37PM -0200, Denis Vlasenko wrote: > On 11 April 2002 09:36, David S. Miller wrote: > > No, you must block truncate operations on the file until the client > > ACK's the nfsd read request if you wish to use sendfile() with > > nfsd. > > Which shouldn't be a big performance problem unless I am unaware > of some real-life applications doing heavy truncates. Every unlink does a truncate. There are applications that delete files a lot. -Andi ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-11 13:16 ` Andi Kleen @ 2002-04-11 17:36 ` Benjamin LaHaise 2002-04-16 0:17 ` Mike Fedyk 1 sibling, 0 replies; 54+ messages in thread From: Benjamin LaHaise @ 2002-04-11 17:36 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel On Thu, Apr 11, 2002 at 03:16:16PM +0200, Andi Kleen wrote: > On Thu, Apr 11, 2002 at 04:00:37PM -0200, Denis Vlasenko wrote: > > On 11 April 2002 09:36, David S. Miller wrote: > > > No, you must block truncate operations on the file until the client > > > ACK's the nfsd read request if you wish to use sendfile() with > > > nfsd. > > > > Which shouldn't be a big performance problem unless I am unaware > > of some real-life applications doing heavy truncates. > > Every unlink does a truncate. There are applications that delete files > a lot. Not quite. The implicite truncate only happens when the link count falls to 0 and the last user of the inode releases their reference to the inode. -ben -- "A man with a bass just walked in, and he's putting it down on the floor." ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-11 13:16 ` Andi Kleen 2002-04-11 17:36 ` Benjamin LaHaise @ 2002-04-16 0:17 ` Mike Fedyk 2002-04-16 15:37 ` Oliver Xymoron 1 sibling, 1 reply; 54+ messages in thread From: Mike Fedyk @ 2002-04-16 0:17 UTC (permalink / raw) To: Andi Kleen; +Cc: Denis Vlasenko, David S. Miller, taka, linux-kernel On Thu, Apr 11, 2002 at 03:16:16PM +0200, Andi Kleen wrote: > On Thu, Apr 11, 2002 at 04:00:37PM -0200, Denis Vlasenko wrote: > > On 11 April 2002 09:36, David S. Miller wrote: > > > No, you must block truncate operations on the file until the client > > > ACK's the nfsd read request if you wish to use sendfile() with > > > nfsd. > > > > Which shouldn't be a big performance problem unless I am unaware > > of some real-life applications doing heavy truncates. > > Every unlink does a truncate. There are applications that delete files > a lot. Is this true at the filesystem level or only in memory? If so, I could immagine that it would make it much harder to undelete a file when you don't even know how big it was (file set to 0 size)... Why is this required? Could someone say quickly (as I'm sure it's probably quite complex) or point me to some references? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-16 0:17 ` Mike Fedyk @ 2002-04-16 15:37 ` Oliver Xymoron 0 siblings, 0 replies; 54+ messages in thread From: Oliver Xymoron @ 2002-04-16 15:37 UTC (permalink / raw) To: Mike Fedyk Cc: Andi Kleen, Denis Vlasenko, David S. Miller, taka, linux-kernel On Mon, 15 Apr 2002, Mike Fedyk wrote: > On Thu, Apr 11, 2002 at 03:16:16PM +0200, Andi Kleen wrote: > > On Thu, Apr 11, 2002 at 04:00:37PM -0200, Denis Vlasenko wrote: > > > On 11 April 2002 09:36, David S. Miller wrote: > > > > No, you must block truncate operations on the file until the client > > > > ACK's the nfsd read request if you wish to use sendfile() with > > > > nfsd. > > > > > > Which shouldn't be a big performance problem unless I am unaware > > > of some real-life applications doing heavy truncates. > > > > Every unlink does a truncate. There are applications that delete files > > a lot. > > Is this true at the filesystem level or only in memory? If so, I could > immagine that it would make it much harder to undelete a file when you don't > even know how big it was (file set to 0 size)... > > Why is this required? Could someone say quickly (as I'm sure it's probably > quite complex) or point me to some references? Truncate is used to return the formerly used blocks to the free pool. It is possible (and preferable) to avoid flushing out the modified file metadata (inode and indirect blocks) for the deleted file, but recoverability of deleted files has never been high on the priority list. -- "Love the dolphins," she advised him. "Write by W.A.S.T.E.." ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-11 7:52 ` David S. Miller 2002-04-11 11:38 ` Hirokazu Takahashi @ 2002-04-11 17:33 ` Benjamin LaHaise 2002-04-12 8:10 ` Hirokazu Takahashi 1 sibling, 1 reply; 54+ messages in thread From: Benjamin LaHaise @ 2002-04-11 17:33 UTC (permalink / raw) To: David S. Miller; +Cc: taka, ak, linux-kernel On Thu, Apr 11, 2002 at 12:52:16AM -0700, David S. Miller wrote: > I think this idea has such high overhead that it is even not for > consideration, consider SMP. One possibility is to make the inode semaphore a rwsem, and to have NFS take that for read until the sendpage is complete. The idea of splitting the inode semaphore up into two (one rw against truncate) has been bounced around for a few other reasons (like allowing multiple concurrent reads + writes to a file). Perhaps its time to bite the bullet and do it. -ben -- "A man with a bass just walked in, and he's putting it down on the floor." ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-11 17:33 ` Benjamin LaHaise @ 2002-04-12 8:10 ` Hirokazu Takahashi 0 siblings, 0 replies; 54+ messages in thread From: Hirokazu Takahashi @ 2002-04-12 8:10 UTC (permalink / raw) To: bcrl; +Cc: davem, ak, linux-kernel Hi, Thank you for your suggestion. bcrl> One possibility is to make the inode semaphore a rwsem, and to have NFS bcrl> take that for read until the sendpage is complete. The idea of splitting bcrl> the inode semaphore up into two (one rw against truncate) has been bounced bcrl> around for a few other reasons (like allowing multiple concurrent reads + bcrl> writes to a file). Perhaps its time to bite the bullet and do it. It sounds not so bad. Partial truncating would rarely happens, so it might be enough. I'll give it a try. Regards, Hirokazu Takahashi ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-11 6:48 ` David S. Miller 2002-04-11 7:41 ` Hirokazu Takahashi @ 2002-04-12 12:30 ` Hirokazu Takahashi 2002-04-12 12:35 ` Andi Kleen 1 sibling, 1 reply; 54+ messages in thread From: Hirokazu Takahashi @ 2002-04-12 12:30 UTC (permalink / raw) To: davem; +Cc: ak, linux-kernel Hi, I wondered if regular truncate() and read() might have the same problem, so I tested again and again. And I realized it will occur on any local filesystems. Sometime I could get partly zero filled data instead of file contents. I analysis this situation, read systemcall doesn't lock anything -- no page lock, no semaphore lock -- while someone truncates files partially. It will often happens in case of pagefault in copy_user() to copy file data to user space. I guess if needed, it should be fixed in VFS. davem> Consider truncate() to 1 byte left in that page. To handle mmap()'s davem> of this file the kernel will memset() rest of the page to zero. davem> davem> Now, in the sendfile() case the NFS client sees some page filled davem> mostly of zeros instead of file contents. davem> davem> In sendmsg() knfd case, client sees something reasonable. He will davem> see something that was actually in the file at some point in time. davem> The sendfile() case sees pure garbage, contents that never were in davem> the file at any point in time. davem> davem> We could make knfsd take the write semaphore on the inode until client davem> is known to get the packet but that is the kind of overhead we'd like davem> to avoid. Thank you, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-12 12:30 ` Hirokazu Takahashi @ 2002-04-12 12:35 ` Andi Kleen 2002-04-12 21:22 ` Jamie Lokier 2002-04-12 21:39 ` David S. Miller 0 siblings, 2 replies; 54+ messages in thread From: Andi Kleen @ 2002-04-12 12:35 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: davem, ak, linux-kernel On Fri, Apr 12, 2002 at 09:30:11PM +0900, Hirokazu Takahashi wrote: > Hi, > > I wondered if regular truncate() and read() might have the same > problem, so I tested again and again. > And I realized it will occur on any local filesystems. > Sometime I could get partly zero filled data instead of file contents. > > I analysis this situation, read systemcall doesn't lock anything > -- no page lock, no semaphore lock -- while someone truncates > files partially. > It will often happens in case of pagefault in copy_user() to > copy file data to user space. > > I guess if needed, it should be fixed in VFS. I don't see it as a big problem and would just leave it as it is (for NFS and local) Adding more locking would slow down read() a lot and there should be a good reason to take such a performance hit. Linux did this forever and I don't think anybody ever reported it as a bug, so we can probably safely assume that this behaviour (non atomic truncate) is not a problem for users in practice. -Andi ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-12 12:35 ` Andi Kleen @ 2002-04-12 21:22 ` Jamie Lokier 2002-04-12 21:31 ` David S. Miller 2002-04-12 21:39 ` David S. Miller 1 sibling, 1 reply; 54+ messages in thread From: Jamie Lokier @ 2002-04-12 21:22 UTC (permalink / raw) To: Andi Kleen; +Cc: Hirokazu Takahashi, davem, linux-kernel Andi Kleen wrote: > > I wondered if regular truncate() and read() might have the same > > problem, so I tested again and again. > > And I realized it will occur on any local filesystems. > > Sometime I could get partly zero filled data instead of file contents. > > > I don't see it as a big problem and would just leave it as it is (for NFS > and local) > Adding more locking would slow down read() a lot and there should be > a good reason to take such a performance hit. Linux did this forever > and I don't think anybody ever reported it as a bug, so we can probably > safely assume that this behaviour (non atomic truncate) is not a problem for > users in practice. Ouch! I have a program which can output incorrect results if this is the case. It may seem to use an esoteric locking strategy, but I had no idea it was acceptable for read to return data that truncate is in the middle of zeroing. The program keeps a cache on disk of generated files. For each cached object, there is a metadata file. Metadata files are text files, written in such a way that the first line looks similar to "=12296.0" and so does the last line, but none of the intermediate lines have that form. Multiple programs can access the disk cache at the same time, and must be able to check the metadata files being written by other programs, blocking if necessary while a cached object is being generated. When a metadata file is being written, first it is created, then the cached object is created and written to a related file, and finally the metadata including the first and last marker line is written to the metadata file. When a program reads a metadata file, it reads as much as it can and then checks the first and last lines are identical. If they are, the middle of the file is valid otherwise it isn't -- perhaps the process generating that file died, or has the metadata file locked. This strategy is used so that there's no need to lock the file, in the cache of a cache hit with no complications. If there's a complication we lock with LOCK_SH and try again. >From time to time, when a cache object is invalid, it's appropriate to truncate the metadata file. If that were atomic, end of story. Unfortunately I've just now heard that a read() can successfully interleave some zeros from a parallel truncate. If the timing is right, that means it's possible for the reading process to see a hole of zeros in the middle of the file. The first and last lines would be intact, and the reader would think that the whole file is therefore valid. Bad! This occurs if the reader copies the initial bytes from the page, then the truncation process catches up and zeros out some bytes, but then the reader catches up and beats the truncation process to the end of the file. I'm not advocating more locking in read() -- there's no need, and it is quite important that it is fast! But I would very much appreciate an understanding of the rules that relate reading, writing and truncating processes. How much ordering & atomicity can I depend on? Anything at all? cheers, -- Jamie ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-12 21:22 ` Jamie Lokier @ 2002-04-12 21:31 ` David S. Miller 2002-04-13 0:21 ` Jamie Lokier 2002-04-13 18:52 ` Chris Wedgwood 0 siblings, 2 replies; 54+ messages in thread From: David S. Miller @ 2002-04-12 21:31 UTC (permalink / raw) To: lk; +Cc: ak, taka, linux-kernel From: Jamie Lokier <lk@tantalophile.demon.co.uk> Date: Fri, 12 Apr 2002 22:22:52 +0100 I'm not advocating more locking in read() -- there's no need, and it is quite important that it is fast! But I would very much appreciate an understanding of the rules that relate reading, writing and truncating processes. How much ordering & atomicity can I depend on? Anything at all? Basically none it appears :-) If you need to depend upon a consistent snapshot of what some other thread writes into a file, you must have some locking protocol to use to synchronize with that other thread. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-12 21:31 ` David S. Miller @ 2002-04-13 0:21 ` Jamie Lokier 2002-04-13 6:39 ` Andi Kleen 2002-04-13 18:52 ` Chris Wedgwood 1 sibling, 1 reply; 54+ messages in thread From: Jamie Lokier @ 2002-04-13 0:21 UTC (permalink / raw) To: David S. Miller; +Cc: ak, taka, linux-kernel David S. Miller wrote: > I'm not advocating more locking in read() -- there's no need, and it is > quite important that it is fast! But I would very much appreciate an > understanding of the rules that relate reading, writing and truncating > processes. How much ordering & atomicity can I depend on? Anything at all? > > Basically none it appears :-) > > If you need to depend upon a consistent snapshot of what some other > thread writes into a file, you must have some locking protocol to use > to synchronize with that other thread. Darn, I was hoping to avoid system calls. Perhaps it's good fortune that futexes just arrived :-) In some ways, it seems entirely reasonable for truncate() to behave as if it were writing zeros. That is, after all, what you see there if the file is expanded later with a hole. I wonder if it is reasonable to depend on that: -- i.e. I'll only ever see zeros, not say random bytes, or ones or something. I'm sure that's so with the current kernel, and probably all of them ever (except for bugs) but I wonder whether it's ok to rely on that. -- Jamie ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-13 0:21 ` Jamie Lokier @ 2002-04-13 6:39 ` Andi Kleen 2002-04-13 8:01 ` Hirokazu Takahashi 2002-04-13 19:19 ` Eric W. Biederman 0 siblings, 2 replies; 54+ messages in thread From: Andi Kleen @ 2002-04-13 6:39 UTC (permalink / raw) To: Jamie Lokier; +Cc: David S. Miller, ak, taka, linux-kernel > I wonder if it is reasonable to depend on that: -- i.e. I'll only ever > see zeros, not say random bytes, or ones or something. I'm sure that's > so with the current kernel, and probably all of them ever (except for > bugs) but I wonder whether it's ok to rely on that. With truncates you should only ever see zeros. If you want this guarantee over system crashes you need to make sure to use the right file system though (e.g. ext2 or reiserfs without the ordered data mode patches or ext3 in writeback mode could give you junk if the system crashes at the wrong time). Still depending on only seeing zeroes would seem to be a bit fragile on me (what happens when the disk dies for example?), using some other locking protocol is probably more safe. -Andi ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-13 6:39 ` Andi Kleen @ 2002-04-13 8:01 ` Hirokazu Takahashi 2002-04-13 19:19 ` Eric W. Biederman 1 sibling, 0 replies; 54+ messages in thread From: Hirokazu Takahashi @ 2002-04-13 8:01 UTC (permalink / raw) To: ak; +Cc: lk, davem, linux-kernel, Andrew Theurer Hi, Thanks to Andrew Theurer for his help. He posted the result of testing my patches on nfs@lists.sourceforge.net list, we can get a great performance. > I tried the patch with great performance improvement! I ran my nfs read > test (48 clients read 200 MB file from one 4-way SMP NFS server) and > compared your patches to regular 2.5.7. Regular 2.5.7 resulted in 87 MB/sec > with 100% CPU utilization. Your patch resulted 130 MB/sec with 82% CPU > utilization! This is very good! I took profiles, and as expected, > csum_copy and file_read_actor were gone with the patch. Sar reported nearly > 40 MB/sec per gigabit adapter (there are 4) during the test. That is the > most I have seen so far. Soon I will be doing some lock analysis to make > sure we don't have any locking problems. Also, I will see if there is > anyone here at IBM LTC that can assist with your development of zerocopy on > UDP. Thanks for the patch! > > Andrew Theurer ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-13 6:39 ` Andi Kleen 2002-04-13 8:01 ` Hirokazu Takahashi @ 2002-04-13 19:19 ` Eric W. Biederman 2002-04-13 19:37 ` Andi Kleen 1 sibling, 1 reply; 54+ messages in thread From: Eric W. Biederman @ 2002-04-13 19:19 UTC (permalink / raw) To: Andi Kleen; +Cc: Jamie Lokier, David S. Miller, taka, linux-kernel Andi Kleen <ak@suse.de> writes: > > I wonder if it is reasonable to depend on that: -- i.e. I'll only ever > > see zeros, not say random bytes, or ones or something. I'm sure that's > > so with the current kernel, and probably all of them ever (except for > > bugs) but I wonder whether it's ok to rely on that. > > With truncates you should only ever see zeros. If you want this guarantee > over system crashes you need to make sure to use the right file system > though (e.g. ext2 or reiserfs without the ordered data mode patches or > ext3 in writeback mode could give you junk if the system crashes at the > wrong time). Still depending on only seeing zeroes would > seem to be a bit fragile on me (what happens when the disk dies for > example?), using some other locking protocol is probably more safe. Could the garbage from ext3 in writeback mode be considered an information leak? I know that is why most places in the kernel initialize pages to 0. So you don't accidentally see what another user put there. Eric k ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-13 19:19 ` Eric W. Biederman @ 2002-04-13 19:37 ` Andi Kleen 2002-04-13 20:34 ` Eric W. Biederman 0 siblings, 1 reply; 54+ messages in thread From: Andi Kleen @ 2002-04-13 19:37 UTC (permalink / raw) To: Eric W. Biederman Cc: Andi Kleen, Jamie Lokier, David S. Miller, taka, linux-kernel On Sat, Apr 13, 2002 at 01:19:46PM -0600, Eric W. Biederman wrote: > Could the garbage from ext3 in writeback mode be considered an > information leak? I know that is why most places in the kernel > initialize pages to 0. So you don't accidentally see what another > user put there. Yes it could. But then ext2/ffs have the same problem and so far people were able to live on with that. -Andi ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-13 19:37 ` Andi Kleen @ 2002-04-13 20:34 ` Eric W. Biederman 2002-04-24 23:11 ` Mike Fedyk 0 siblings, 1 reply; 54+ messages in thread From: Eric W. Biederman @ 2002-04-13 20:34 UTC (permalink / raw) To: Andi Kleen; +Cc: Jamie Lokier, David S. Miller, taka, linux-kernel Andi Kleen <ak@suse.de> writes: > On Sat, Apr 13, 2002 at 01:19:46PM -0600, Eric W. Biederman wrote: > > Could the garbage from ext3 in writeback mode be considered an > > information leak? I know that is why most places in the kernel > > initialize pages to 0. So you don't accidentally see what another > > user put there. > > Yes it could. But then ext2/ffs have the same problem and so far people were > able to live on with that. The reason I asked, is the description sounded specific to ext3. Also with ext3 a supported way to shutdown is to just pull the power on the machine. And the filesystem comes back to life without a full fsck. So if this can happen when all you need is to replay the journal, I have issues with it. If this happens in the case of damaged filesystem I don't. Eric ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-13 20:34 ` Eric W. Biederman @ 2002-04-24 23:11 ` Mike Fedyk 2002-04-25 17:11 ` Andreas Dilger 0 siblings, 1 reply; 54+ messages in thread From: Mike Fedyk @ 2002-04-24 23:11 UTC (permalink / raw) To: Eric W. Biederman Cc: Andi Kleen, Jamie Lokier, David S. Miller, taka, linux-kernel On Sat, Apr 13, 2002 at 02:34:12PM -0600, Eric W. Biederman wrote: > Andi Kleen <ak@suse.de> writes: > > > On Sat, Apr 13, 2002 at 01:19:46PM -0600, Eric W. Biederman wrote: > > > Could the garbage from ext3 in writeback mode be considered an > > > information leak? I know that is why most places in the kernel > > > initialize pages to 0. So you don't accidentally see what another > > > user put there. > > > > Yes it could. But then ext2/ffs have the same problem and so far people were > > able to live on with that. > > The reason I asked, is the description sounded specific to ext3. Also > with ext3 a supported way to shutdown is to just pull the power on the > machine. And the filesystem comes back to life without a full fsck. > > So if this can happen when all you need is to replay the journal, I > have issues with it. If this happens in the case of damaged > filesystem I don't. > Actually, with ext3 the only mode IIRC is data=journal that will keep this from happening. In ordered or writeback mode there is a window where the pages will be zeroed in memory, but not on disk. Admittedly, the time window is largest in writeback mode, smaller in ordered and smallest (non-existant?) in data journaling mode. Mike ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-24 23:11 ` Mike Fedyk @ 2002-04-25 17:11 ` Andreas Dilger 0 siblings, 0 replies; 54+ messages in thread From: Andreas Dilger @ 2002-04-25 17:11 UTC (permalink / raw) To: Eric W. Biederman, Andi Kleen, Jamie Lokier, David S. Miller, taka, linux-kernel On Apr 24, 2002 16:11 -0700, Mike Fedyk wrote: > Actually, with ext3 the only mode IIRC is data=journal that will keep this > from happening. In ordered or writeback mode there is a window where the > pages will be zeroed in memory, but not on disk. > > Admittedly, the time window is largest in writeback mode, smaller in ordered > and smallest (non-existant?) in data journaling mode. One thing you are forgetting is that with data=ordered mode, the inode itself is not updated until the data has been written to the disk. So technically you are correct - with ordered mode there is a window where pages are updated in memory but not on disk, but if you crash during that window the inode size will be the old size so you will still not be able to access the un-zero'd data on disk. It is only with data=writeback that this could be a problem, because there is no ordering between updating the inode and writing the data to disk. That's why there is only a real benefit to using data=writeback for applications like databases and such where the file size doesn't change and you are writing into the middle of the file. In many cases, data=ordered is actually faster than data=writeback. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-12 21:31 ` David S. Miller 2002-04-13 0:21 ` Jamie Lokier @ 2002-04-13 18:52 ` Chris Wedgwood 2002-04-14 0:07 ` Keith Owens 1 sibling, 1 reply; 54+ messages in thread From: Chris Wedgwood @ 2002-04-13 18:52 UTC (permalink / raw) To: David S. Miller; +Cc: lk, ak, taka, linux-kernel On Fri, Apr 12, 2002 at 02:31:50PM -0700, David S. Miller wrote: If you need to depend upon a consistent snapshot of what some other thread writes into a file, you must have some locking protocol to use to synchronize with that other thread. Appends of small-writes (for whatever reason) seems to be atomic, AFAIK nobody gets corrupt apache logs for example. --cw ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-13 18:52 ` Chris Wedgwood @ 2002-04-14 0:07 ` Keith Owens 2002-04-14 8:19 ` Chris Wedgwood 0 siblings, 1 reply; 54+ messages in thread From: Keith Owens @ 2002-04-14 0:07 UTC (permalink / raw) To: linux-kernel On Sat, 13 Apr 2002 11:52:49 -0700, Chris Wedgwood <cw@f00f.org> wrote: >On Fri, Apr 12, 2002 at 02:31:50PM -0700, David S. Miller wrote: > > If you need to depend upon a consistent snapshot of what some > other thread writes into a file, you must have some locking > protocol to use to synchronize with that other thread. > >Appends of small-writes (for whatever reason) seems to be atomic, >AFAIK nobody gets corrupt apache logs for example. Write in append mode must be atomic in the kernel. Whether a user space write in append mode is atomic or not depends on how many write() syscalls it takes to pass the data into the kernel. Each write() append will be atomic but multiple writes can be interleaved. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-14 0:07 ` Keith Owens @ 2002-04-14 8:19 ` Chris Wedgwood 2002-04-14 8:40 ` Keith Owens 0 siblings, 1 reply; 54+ messages in thread From: Chris Wedgwood @ 2002-04-14 8:19 UTC (permalink / raw) To: Keith Owens; +Cc: linux-kernel On Sun, Apr 14, 2002 at 10:07:56AM +1000, Keith Owens wrote: Write in append mode must be atomic in the kernel. Whether a user space write in append mode is atomic or not depends on how many write() syscalls it takes to pass the data into the kernel. Each write() append will be atomic but multiple writes can be interleaved. Up to what size? I assume I cannot assume O_APPEND atomicity for (say) 100M writes? --cw ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-14 8:19 ` Chris Wedgwood @ 2002-04-14 8:40 ` Keith Owens 0 siblings, 0 replies; 54+ messages in thread From: Keith Owens @ 2002-04-14 8:40 UTC (permalink / raw) To: Chris Wedgwood; +Cc: linux-kernel On Sun, 14 Apr 2002 01:19:46 -0700, Chris Wedgwood <cw@f00f.org> wrote: >On Sun, Apr 14, 2002 at 10:07:56AM +1000, Keith Owens wrote: > > Write in append mode must be atomic in the kernel. Whether a user > space write in append mode is atomic or not depends on how many > write() syscalls it takes to pass the data into the kernel. Each > write() append will be atomic but multiple writes can be > interleaved. > >Up to what size? I assume I cannot assume O_APPEND atomicity for >(say) 100M writes? Atomic on that inode, not atomic wrt other I/O to other inodes. Most write operations use generic_file_write() which grabs the inode semaphore. No other writes (or indeed any other I/O) can proceed on the inode until this write completes and releases the semaphore. I suppose that some filesystem could use its own write method that releases the lock during the write operation. I would not trust my data to such filesystems, they violate SUSV2. "If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation" ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-12 12:35 ` Andi Kleen 2002-04-12 21:22 ` Jamie Lokier @ 2002-04-12 21:39 ` David S. Miller 2002-04-15 1:30 ` Hirokazu Takahashi 1 sibling, 1 reply; 54+ messages in thread From: David S. Miller @ 2002-04-12 21:39 UTC (permalink / raw) To: ak; +Cc: taka, linux-kernel From: Andi Kleen <ak@suse.de> Date: Fri, 12 Apr 2002 14:35:59 +0200 On Fri, Apr 12, 2002 at 09:30:11PM +0900, Hirokazu Takahashi wrote: > I analysis this situation, read systemcall doesn't lock anything > -- no page lock, no semaphore lock -- while someone truncates > files partially. > It will often happens in case of pagefault in copy_user() to > copy file data to user space. I don't see it as a big problem and would just leave it as it is (for NFS and local) I agree with Andi. You can basically throw away my whole argument about this. Applications that require synchonization between the writer of file contents and reader of file contents must do some kind of locking amongst themselves at user level. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-12 21:39 ` David S. Miller @ 2002-04-15 1:30 ` Hirokazu Takahashi 2002-04-15 4:23 ` David S. Miller 0 siblings, 1 reply; 54+ messages in thread From: Hirokazu Takahashi @ 2002-04-15 1:30 UTC (permalink / raw) To: davem; +Cc: ak, linux-kernel Hello, David If you don't mind, could you give me some advises about the sendpage mechanism. I'd like to implenent sendpage of UDP stack which NFS uses heavily. It may improve the performance of NFS over UDP dramastically. I wonder if there were "SENDPAGES" interface instead of sendpage between socket layer and inet layer, we could send some pages atomically with low overhead. And it could make implementing RPC over UDP easier to send multiple pages as one UDP pakcet easily. How do you think about this approach? davem> I don't see it as a big problem and would just leave it as it is davem> (for NFS and local) davem> davem> I agree with Andi. You can basically throw away my whole argument davem> about this. Applications that require synchonization between the davem> writer of file contents and reader of file contents must do some davem> kind of locking amongst themselves at user level. OK. Regards, Hirokazu Takahashi ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-15 1:30 ` Hirokazu Takahashi @ 2002-04-15 4:23 ` David S. Miller 2002-04-16 1:03 ` Hirokazu Takahashi 0 siblings, 1 reply; 54+ messages in thread From: David S. Miller @ 2002-04-15 4:23 UTC (permalink / raw) To: taka; +Cc: ak, linux-kernel From: Hirokazu Takahashi <taka@valinux.co.jp> Date: Mon, 15 Apr 2002 10:30:13 +0900 (JST) I'd like to implenent sendpage of UDP stack which NFS uses heavily. It may improve the performance of NFS over UDP dramastically. I wonder if there were "SENDPAGES" interface instead of sendpage between socket layer and inet layer, we could send some pages atomically with low overhead. And it could make implementing RPC over UDP easier to send multiple pages as one UDP pakcet easily. How do you think about this approach? Sendpages mechanism will not be implemented. You must implement UDP sendfile() one page at a time, by building up an SKB with multiple calls similar to TCP with TCP_CORK socket option set. For datagram sockets, define temporary SKB hung off of struct sock. Define UDP_CORK socket option which begins the "queue data only" state. All sendmsg()/sendfile() calls append to temporary SKB, first sendmsg()/sendfile() call to UDP will create this sock->skb. First call may be sendmsg() but subsequent calls for that SKB must be sendfile() calls. If this pattern of calls is broken, SKB is sent. Call to set UDP_CORK socket option to zero actually sends the SKB being built. The normal usage will be: setsockopt(fd, UDP_CORK, 1); sendmsg(fd, sunrpc_headers, sizeof(sunrpc_headers)); sendfile(fd, ...); setsockopt(fd, UDP_CORK, 0); ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-15 4:23 ` David S. Miller @ 2002-04-16 1:03 ` Hirokazu Takahashi 2002-04-16 1:41 ` Jakob Østergaard 0 siblings, 1 reply; 54+ messages in thread From: Hirokazu Takahashi @ 2002-04-16 1:03 UTC (permalink / raw) To: davem; +Cc: ak, linux-kernel Hi, David Thank you for your advice! davem> Sendpages mechanism will not be implemented. davem> davem> You must implement UDP sendfile() one page at a time, by building up davem> an SKB with multiple calls similar to TCP with TCP_CORK socket option davem> set. davem> davem> For datagram sockets, define temporary SKB hung off of struct sock. davem> Define UDP_CORK socket option which begins the "queue data only" davem> state. davem> davem> All sendmsg()/sendfile() calls append to temporary SKB, first davem> sendmsg()/sendfile() call to UDP will create this sock->skb. First davem> call may be sendmsg() but subsequent calls for that SKB must be davem> sendfile() calls. If this pattern of calls is broken, SKB is sent. davem> davem> Call to set UDP_CORK socket option to zero actually sends the SKB davem> being built. davem> davem> The normal usage will be: davem> davem> setsockopt(fd, UDP_CORK, 1); davem> sendmsg(fd, sunrpc_headers, sizeof(sunrpc_headers)); davem> sendfile(fd, ...); davem> setsockopt(fd, UDP_CORK, 0); Yes, it seems to be the most general way. OK, I'll do this way first of all. In the kernel, probaboly I'd impelement as following: put a RPC header and a NFS header on "bufferA"; down(semaphore); sendmsg(bufferA, MSG_MORE); for (eache pages of fileC) sock->opt->sendpage(page, islastpage ? 0 : MSG_MORE) up(semaphore); the semaphore is required to serialize sending data as many knfsd kthreads use the same socket. Actually I'd like to implement it like following codes, but unfortunatelly it wouldn't work on UDP socket of servers as the socket doesn't have specific destination address at all, and sendpage has no arguments to specify it. It's not so good.... put a RPC header and a NFS header on "pageB"; down(semaphore); sock->opt->sendpage(pageB, MSG_MORE); for (each pages of fileC) sock->opt->sendpage(page, islastpage ? 0 : MSG_MORE) up(semaphore); Thank you, Hirokazu Takahashi ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-16 1:03 ` Hirokazu Takahashi @ 2002-04-16 1:41 ` Jakob Østergaard 2002-04-16 2:20 ` Hirokazu Takahashi 2002-04-18 5:01 ` Hirokazu Takahashi 0 siblings, 2 replies; 54+ messages in thread From: Jakob Østergaard @ 2002-04-16 1:41 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: davem, ak, linux-kernel On Tue, Apr 16, 2002 at 10:03:02AM +0900, Hirokazu Takahashi wrote: > Hi, David > ... > > Yes, it seems to be the most general way. > OK, I'll do this way first of all. > > In the kernel, probaboly I'd impelement as following: > > put a RPC header and a NFS header on "bufferA"; > down(semaphore); > sendmsg(bufferA, MSG_MORE); > for (eache pages of fileC) > sock->opt->sendpage(page, islastpage ? 0 : MSG_MORE) > up(semaphore); > > the semaphore is required to serialize sending data as many knfsd kthreads > use the same socket. Won't this serialize too much ? I mean, consider the situation where we have file-A and file-B completely in cache, while file-C needs to be read from the physical disk. Three different clients (A, B and C) request file-A, file-B and file-C respectively. The send of file-C is started first, and the sends of files A and B (which could commence immediately and complete at near wire-speed) will now have to wait (leaving the NIC idle) until file-C is read from the disks. Even if it's not the entire file but only a single NFS request (probably 8kB), one disk seek (7ms) is still around 85 kB, or 10 8kB NFS requests (at 100Mbit). Or am I misunderstanding ? Will your UDP sendpage() queue the requests ? -- ................................................................ : jakob@unthought.net : And I see the elder races, : :.........................: putrid forms of man : : Jakob Østergaard : See him rise and claim the earth, : : OZ9ABN : his downfall is at hand. : :.........................:............{Konkhra}...............: ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-16 1:41 ` Jakob Østergaard @ 2002-04-16 2:20 ` Hirokazu Takahashi 2002-04-18 5:01 ` Hirokazu Takahashi 1 sibling, 0 replies; 54+ messages in thread From: Hirokazu Takahashi @ 2002-04-16 2:20 UTC (permalink / raw) To: jakob; +Cc: davem, ak, linux-kernel Hi, jakob> Won't this serialize too much ? I mean, consider the situation where we jakob> have file-A and file-B completely in cache, while file-C needs to be jakob> read from the physical disk. jakob> jakob> Three different clients (A, B and C) request file-A, file-B and file-C jakob> respectively. The send of file-C is started first, and the sends of files jakob> A and B (which could commence immediately and complete at near wire-speed) jakob> will now have to wait (leaving the NIC idle) until file-C is read from jakob> the disks. jakob> jakob> Even if it's not the entire file but only a single NFS request (probably 8kB), jakob> one disk seek (7ms) is still around 85 kB, or 10 8kB NFS requests (at 100Mbit). jakob> jakob> Or am I misunderstanding ? Will your UDP sendpage() queue the requests ? No problem. On my implementation, at the beginning a knfsd grabs all pages -- a part of file-C -- to reply to the NFS client. After that the knfsd starts to send them. It won't block any other knfsds during disk I/Os. Thank you, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-16 1:41 ` Jakob Østergaard 2002-04-16 2:20 ` Hirokazu Takahashi @ 2002-04-18 5:01 ` Hirokazu Takahashi 2002-04-18 7:58 ` Jakob Østergaard 2002-04-18 8:53 ` Trond Myklebust 1 sibling, 2 replies; 54+ messages in thread From: Hirokazu Takahashi @ 2002-04-18 5:01 UTC (permalink / raw) To: jakob; +Cc: davem, ak, linux-kernel Hi, I've been thinking about your comment, and I realized it was a good suggestion. There are no problem with the zerocopy NFS, but If you want to use UDP sendfile for streaming or something like that, you wouldn't get good performance. jakob> > the semaphore is required to serialize sending data as many knfsd kthreads jakob> > use the same socket. jakob> jakob> Won't this serialize too much ? I mean, consider the situation where we jakob> have file-A and file-B completely in cache, while file-C needs to be jakob> read from the physical disk. jakob> jakob> Three different clients (A, B and C) request file-A, file-B and file-C jakob> respectively. The send of file-C is started first, and the sends of files jakob> A and B (which could commence immediately and complete at near wire-speed) jakob> will now have to wait (leaving the NIC idle) until file-C is read from jakob> the disks. jakob> jakob> Even if it's not the entire file but only a single NFS request (probably 8kB), jakob> one disk seek (7ms) is still around 85 kB, or 10 8kB NFS requests (at 100Mbit). jakob> jakob> Or am I misunderstanding ? Will your UDP sendpage() queue the requests ? There may be many threads on a streaming server and they would share the same UDP socket. UDP_CORK mechanism requires semaphore to serialize sending data and it would block each other for a long time because sendfile() might have the threads sleep in BIO as you said. You may say we can use MSG_MORE instead of UDP_CORK. When we use it, we have to make multiple queue for each destination for clientns. But we can't link pages to the queue as sendfile() have no arguments specifying the destination. client UDP sockets +---------+ |dest:123 |---------+ | | | +---------+ | server V UDP socket +---------+ +---------+ <--- thread1 |dest:123 |------->|src:123 | <--- thread2 | | |dest:ANY | <--- thread3 +---------+ +---------+ <--- thread4 A +---------+ | |dest:123 |---------+ | | +---------+ Shall I make a multiple queue based on pid instead of destitation address ? Any idea is welcome! Thank you, Hirokazu Takahashi ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-18 5:01 ` Hirokazu Takahashi @ 2002-04-18 7:58 ` Jakob Østergaard 2002-04-18 8:53 ` Trond Myklebust 1 sibling, 0 replies; 54+ messages in thread From: Jakob Østergaard @ 2002-04-18 7:58 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: davem, ak, linux-kernel On Thu, Apr 18, 2002 at 02:01:55PM +0900, Hirokazu Takahashi wrote: > Hi, > > I've been thinking about your comment, and I realized it was a good > suggestion. > There are no problem with the zerocopy NFS, but If you want to > use UDP sendfile for streaming or something like that, you wouldn't > get good performance. Hi again, So the problem is, that it is too easy to use UDP sendfile "poorly", right? Your NFS threads don't have the problem because you make sure that pages are in core prior to the sendfile call, but not every developer may think that far... ... > > Shall I make a multiple queue based on pid instead of destitation address ? > Any idea is welcome! Ok, so here's some ideas. I'm no expert so if something below seems subtle, it's more likely to be plain stupid rather than something really clever ;) In order to keep sendfile as simple as possible, perhaps one could just make it fail if not all pages are in core. So, your NFS send routine would be something like retry: submit_read_requests await_io_completion rc = sendfile(..) if (rc == -EFAULT) goto retry (I suppose even the retry is optional - this being UDP, the packet could be dropped anywhere anyway. The rationale behind retrying immediately is, that "almost" all pages probably are in core) That would keep sendfile simple, and force it's users to think of a clever way to make sure the pages are ready (and about what to do if they aren't). This is obviously not something one can do from userspace. There, I think that your suggestion with a queue per pid seems like a nice solution. What I worry about is, if the machine is under heavy memory pressure and the queue entries start piling up - if sendfile is not clever (or somehow lets the VM figure out what to do with the requests), the queues will be competing against eachother, taking even longer to finish... Perhaps userspace would call some sendfile wrapper routine that would do the queue management, while kernel threads that are sufficiently clever by themselves will just call the lightweight sendfile. Or, will the queue management be simpler and less dangerous than I think ? :) Cheers, -- ................................................................ : jakob@unthought.net : And I see the elder races, : :.........................: putrid forms of man : : Jakob Østergaard : See him rise and claim the earth, : : OZ9ABN : his downfall is at hand. : :.........................:............{Konkhra}...............: ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-18 5:01 ` Hirokazu Takahashi 2002-04-18 7:58 ` Jakob Østergaard @ 2002-04-18 8:53 ` Trond Myklebust 2002-04-19 3:21 ` Hirokazu Takahashi 1 sibling, 1 reply; 54+ messages in thread From: Trond Myklebust @ 2002-04-18 8:53 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: jakob, davem, ak, linux-kernel >>>>> " " == Hirokazu Takahashi <taka@valinux.co.jp> writes: > Hi, I've been thinking about your comment, and I realized it > was a good suggestion. There are no problem with the zerocopy > NFS, but If you want to use UDP sendfile for streaming or > something like that, you wouldn't get good performance. Surely one can work around this in userland without inventing a load of ad-hoc schemes in the kernel socket layer? If one doesn't want to create a pool of sockets in order to service the different threads, one can use generic methods such as sys_readahead() in order to ensure that the relevant data gets paged in prior to hogging the socket. There is no difference between UDP and TCP sendfile() in this respect. Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-18 8:53 ` Trond Myklebust @ 2002-04-19 3:21 ` Hirokazu Takahashi 2002-04-19 9:18 ` Trond Myklebust 0 siblings, 1 reply; 54+ messages in thread From: Hirokazu Takahashi @ 2002-04-19 3:21 UTC (permalink / raw) To: trond.myklebust; +Cc: jakob, davem, ak, linux-kernel Hi, > > Hi, I've been thinking about your comment, and I realized it > > was a good suggestion. There are no problem with the zerocopy > > NFS, but If you want to use UDP sendfile for streaming or > > something like that, you wouldn't get good performance. > > Surely one can work around this in userland without inventing a load > of ad-hoc schemes in the kernel socket layer? > > If one doesn't want to create a pool of sockets in order to service > the different threads, one can use generic methods such as > sys_readahead() in order to ensure that the relevant data gets paged > in prior to hogging the socket. That makes sense. It would work good enough in many cases, though it would be hard to make sure that it really exists in core before sendfile(). > There is no difference between UDP and TCP sendfile() in this respect. Yes. And it seems to be more important on UDP sendfile(). processes or threads sharing the same UDP socket would affect each other, while processes or threads on TCP sockets don't care about it as TCP connection is peer to peer. Thank you, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-19 3:21 ` Hirokazu Takahashi @ 2002-04-19 9:18 ` Trond Myklebust 2002-04-20 7:47 ` Hirokazu Takahashi [not found] ` <200204192128.QAA24592@popmail.austin.ibm.com> 0 siblings, 2 replies; 54+ messages in thread From: Trond Myklebust @ 2002-04-19 9:18 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: linux-kernel On Friday 19. April 2002 05:21, Hirokazu Takahashi wrote: > And it seems to be more important on UDP sendfile(). > processes or threads sharing the same UDP socket would affect each other, > while processes or threads on TCP sockets don't care about it as TCP > connection is peer to peer. No. It is not the lack of peer-to-peer connections that gives rise to the bottleneck, but the idea of several threads multiplexing sendfile() through a single socket. Given a bad program design, it can be done over TCP too. The conclusion is that the programmer really ought to choose a different design. For multimedia streaming, for instance, it makes sense to use 1 UDP socket per thread rather than to multiplex the output through one socket. Cheers, Trond ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-19 9:18 ` Trond Myklebust @ 2002-04-20 7:47 ` Hirokazu Takahashi 2002-04-25 12:37 ` Possible bug with UDP and SO_REUSEADDR. Was " Terje Eggestad [not found] ` <200204192128.QAA24592@popmail.austin.ibm.com> 1 sibling, 1 reply; 54+ messages in thread From: Hirokazu Takahashi @ 2002-04-20 7:47 UTC (permalink / raw) To: trond.myklebust; +Cc: linux-kernel Hi, > > And it seems to be more important on UDP sendfile(). > > processes or threads sharing the same UDP socket would affect each other, > > while processes or threads on TCP sockets don't care about it as TCP > > connection is peer to peer. > > No. It is not the lack of peer-to-peer connections that gives rise to the > bottleneck, but the idea of several threads multiplexing sendfile() through a > single socket. Given a bad program design, it can be done over TCP too. > > The conclusion is that the programmer really ought to choose a different > design. For multimedia streaming, for instance, it makes sense to use 1 UDP > socket per thread rather than to multiplex the output through one socket. You mean, create UDP sockets which have the same port number? Yes we can if we use setsockopt(SO_REUSEADDR). And it could lead less contention between CPUs. Sounds good! Thank you, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Possible bug with UDP and SO_REUSEADDR. Was Re: [PATCH] zerocopy NFS updated 2002-04-20 7:47 ` Hirokazu Takahashi @ 2002-04-25 12:37 ` Terje Eggestad 2002-04-26 2:43 ` David S. Miller 0 siblings, 1 reply; 54+ messages in thread From: Terje Eggestad @ 2002-04-25 12:37 UTC (permalink / raw) To: linux-kernel Seeing this mail from Hirokazu my curiosity triggered, I'd never contemplated using reuse on a udp socket. However writing a test server that stand in blocking wait on a UDP socket, and start two instances of the server it's ALWAYS the server last started that get the udp message, even if it's not in blocking wait, and the first started server is. Smells like a bug to me, this behavior don't make much sence. Using stock 2.4.17. TJ On Sat, 2002-04-20 at 09:47, Hirokazu Takahashi wrote: > Hi, > > > > And it seems to be more important on UDP sendfile(). > > > processes or threads sharing the same UDP socket would affect each other, > > > while processes or threads on TCP sockets don't care about it as TCP > > > connection is peer to peer. > > > > No. It is not the lack of peer-to-peer connections that gives rise to the > > bottleneck, but the idea of several threads multiplexing sendfile() through a > > single socket. Given a bad program design, it can be done over TCP too. > > > > The conclusion is that the programmer really ought to choose a different > > design. For multimedia streaming, for instance, it makes sense to use 1 UDP > > socket per thread rather than to multiplex the output through one socket. > > You mean, create UDP sockets which have the same port number? > Yes we can if we use setsockopt(SO_REUSEADDR). > And it could lead less contention between CPUs. > Sounds good! > > Thank you, > Hirokazu Takahashi. > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- _________________________________________________________________________ Terje Eggestad mailto:terje.eggestad@scali.no Scali Scalable Linux Systems http://www.scali.com Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE) P.O.Box 150, Oppsal +47 975 31 574 (MOBILE) N-0619 Oslo fax: +47 22 62 89 51 NORWAY _________________________________________________________________________ ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Possible bug with UDP and SO_REUSEADDR. Was Re: [PATCH] zerocopy NFS updated 2002-04-25 12:37 ` Possible bug with UDP and SO_REUSEADDR. Was " Terje Eggestad @ 2002-04-26 2:43 ` David S. Miller 2002-04-26 7:38 ` Terje Eggestad 2002-04-29 0:41 ` Possible bug with UDP and SO_REUSEADDR David Schwartz 0 siblings, 2 replies; 54+ messages in thread From: David S. Miller @ 2002-04-26 2:43 UTC (permalink / raw) To: terje.eggestad; +Cc: linux-kernel From: Terje Eggestad <terje.eggestad@scali.com> Date: 25 Apr 2002 14:37:44 +0200 However writing a test server that stand in blocking wait on a UDP socket, and start two instances of the server it's ALWAYS the server last started that get the udp message, even if it's not in blocking wait, and the first started server is. Smells like a bug to me, this behavior don't make much sence. Using stock 2.4.17. Can you post your test server/client application so that I don't have to write it myself and guess how you did things? Thanks. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Possible bug with UDP and SO_REUSEADDR. Was Re: [PATCH] zerocopy NFS updated 2002-04-26 2:43 ` David S. Miller @ 2002-04-26 7:38 ` Terje Eggestad 2002-04-29 0:41 ` Possible bug with UDP and SO_REUSEADDR David Schwartz 1 sibling, 0 replies; 54+ messages in thread From: Terje Eggestad @ 2002-04-26 7:38 UTC (permalink / raw) To: David S. Miller; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 1394 bytes --] 'course On Fri, 2002-04-26 at 04:43, David S. Miller wrote: > From: Terje Eggestad <terje.eggestad@scali.com> > Date: 25 Apr 2002 14:37:44 +0200 > > However writing a test server that stand in blocking wait on a UDP > socket, and start two instances of the server it's ALWAYS the server > last started that get the udp message, even if it's not in blocking > wait, and the first started server is. > > Smells like a bug to me, this behavior don't make much sence. > > Using stock 2.4.17. > > Can you post your test server/client application so that I > don't have to write it myself and guess how you did things? > > Thanks. > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- _________________________________________________________________________ Terje Eggestad mailto:terje.eggestad@scali.no Scali Scalable Linux Systems http://www.scali.com Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE) P.O.Box 150, Oppsal +47 975 31 574 (MOBILE) N-0619 Oslo fax: +47 22 62 89 51 NORWAY _________________________________________________________________________ [-- Attachment #2: client.c --] [-- Type: text/x-c, Size: 1336 bytes --] #include <stdio.h> #include <errno.h> #include <stdlib.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #include <netdb.h> #include <unistd.h> #include <fcntl.h> #define BS 4096*3 main(int argc, char ** argv) { int s, n, ns, cond, len, rc, val; short port = 6767; struct sockaddr_in mysock; char buffer[BS]; char addr[16], * host; time_t t; struct hostent * he; s = socket(PF_INET, SOCK_DGRAM, 0); if (argc >= 2) { host = argv[1]; } else { printf ("error\n"); }; if (argc >= 3) { port = atoi(argv[2]); }; mysock.sin_family = AF_INET; mysock.sin_port = htons(port); he = gethostbyname(host); if (he == NULL) exit(9); printf ("host %s : %s, %d %#x \n", host, he->h_name, he->h_addrtype, he->h_addr_list[0]); memcpy(&mysock.sin_addr.s_addr, he->h_addr_list[0], he->h_length); printf("\n"); time(&t); rc = sprintf(buffer, "%d : %s : hei hei", getpid(), ctime(&t)); len = sizeof(struct sockaddr_in); inet_ntop(AF_INET, &mysock.sin_addr.s_addr, addr, 16); printf("sento: %s:%d =\"%s\"\n", addr, ntohs(mysock.sin_port), buffer); rc = sendto(s, buffer, rc, 0, (struct sockaddr*) &mysock, len); }; /* * Local variables: * compile-command: "gcc -g -o client client.c" * End: */ [-- Attachment #3: server.c --] [-- Type: text/x-c, Size: 1351 bytes --] #include <stdio.h> #include <errno.h> #include <stdlib.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #include <unistd.h> #include <fcntl.h> #define BS 4096*3 main(int argc, char ** argv) { int s, n, ns, cond, len, rc, val; short port = 6767; struct sockaddr_in mysock, peeraddr; char buffer[BS]; char addr[16]; s = socket(PF_INET, SOCK_DGRAM, 0); if (argc >= 2) { port = atoi(argv[1]); }; mysock.sin_family = AF_INET; mysock.sin_port = htons(port); mysock.sin_addr.s_addr = INADDR_ANY; val = 1; rc = setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(int)); printf ("setsockopt => %d (%d)\n", rc, errno); rc = bind(s, (struct sockaddr*) &mysock, sizeof(struct sockaddr_in)); printf ("bind => %d (%d)\n", rc, errno); while(1) { printf("\n"); len = sizeof(struct sockaddr_in); rc = recvfrom(s, buffer, BS, 0, (struct sockaddr*) &peeraddr, &len); if (rc == -1) printf("recvfrom returned %d (%d)\n", rc, errno); else { inet_ntop(AF_INET, &peeraddr.sin_addr.s_addr, addr, 16); printf("recvfrom got from %s:%d =\"%s\"\n", addr, ntohs(peeraddr.sin_port), buffer); }; sleep(1); }; }; /* * Local variables: * compile-command: "gcc -g -o server server.c" * End: */ ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Possible bug with UDP and SO_REUSEADDR. 2002-04-26 2:43 ` David S. Miller 2002-04-26 7:38 ` Terje Eggestad @ 2002-04-29 0:41 ` David Schwartz 2002-04-29 8:06 ` Terje Eggestad 1 sibling, 1 reply; 54+ messages in thread From: David Schwartz @ 2002-04-29 0:41 UTC (permalink / raw) To: linux-kernel On Thu, 25 Apr 2002 19:43:01 -0700 (PDT), David S. Miller wrote: >From: Terje Eggestad <terje.eggestad@scali.com> >Date: 25 Apr 2002 14:37:44 +0200 > >However writing a test server that stand in blocking wait on a UDP >socket, and start two instances of the server it's ALWAYS the server >last started that get the udp message, even if it's not in blocking >wait, and the first started server is. > >Smells like a bug to me, this behavior don't make much sence. > >Using stock 2.4.17. > >Can you post your test server/client application so that I >don't have to write it myself and guess how you did things? There are really two possibilities: 1) The two instances are cooperating closely together and should be sharing a socket (not each opening one), or 2) The two instances are not cooperating closely together and each own their own socket. For all the kernel knows, they don't even know about each other. In the first case, it's logical for whichever one happens to try to read first to get the/a datagram. In the second case, it's logical for the kernel to pick one and give it all the data. DS ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Possible bug with UDP and SO_REUSEADDR. 2002-04-29 0:41 ` Possible bug with UDP and SO_REUSEADDR David Schwartz @ 2002-04-29 8:06 ` Terje Eggestad 2002-04-29 8:44 ` David Schwartz 0 siblings, 1 reply; 54+ messages in thread From: Terje Eggestad @ 2002-04-29 8:06 UTC (permalink / raw) To: David Schwartz; +Cc: linux-kernel On Mon, 2002-04-29 at 02:41, David Schwartz wrote: > > > On Thu, 25 Apr 2002 19:43:01 -0700 (PDT), David S. Miller wrote: > >From: Terje Eggestad <terje.eggestad@scali.com> > >Date: 25 Apr 2002 14:37:44 +0200 > > > >However writing a test server that stand in blocking wait on a UDP > >socket, and start two instances of the server it's ALWAYS the server > >last started that get the udp message, even if it's not in blocking > >wait, and the first started server is. > > > >Smells like a bug to me, this behavior don't make much sence. > > > >Using stock 2.4.17. > > > >Can you post your test server/client application so that I > >don't have to write it myself and guess how you did things? > > There are really two possibilities: > > 1) The two instances are cooperating closely together and should be sharing > a socket (not each opening one), or > > 2) The two instances are not cooperating closely together and each own their > own socket. For all the kernel knows, they don't even know about each other. > > In the first case, it's logical for whichever one happens to try to read > first to get the/a datagram. In the second case, it's logical for the kernel > to pick one and give it all the data. > > DS > IMHO, in the second case it's logical for the kernel NOT to allow the second to bind to the port at all. Which it actually does, it's the normal case. When you set the SO_REUSEADDR flag on the socket you're telling the kernel that we're in case 1). TJ > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- _________________________________________________________________________ Terje Eggestad mailto:terje.eggestad@scali.no Scali Scalable Linux Systems http://www.scali.com Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE) P.O.Box 150, Oppsal +47 975 31 574 (MOBILE) N-0619 Oslo fax: +47 22 62 89 51 NORWAY _________________________________________________________________________ ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Possible bug with UDP and SO_REUSEADDR. 2002-04-29 8:06 ` Terje Eggestad @ 2002-04-29 8:44 ` David Schwartz 2002-04-29 10:03 ` Terje Eggestad 0 siblings, 1 reply; 54+ messages in thread From: David Schwartz @ 2002-04-29 8:44 UTC (permalink / raw) To: terje.eggestad; +Cc: linux-kernel >> 1) The two instances are cooperating closely together and should be >>sharing >>a socket (not each opening one), or >> >> 2) The two instances are not cooperating closely together and each own >>their >>own socket. For all the kernel knows, they don't even know about each >>other. >> >> In the first case, it's logical for whichever one happens to try to read >>first to get the/a datagram. In the second case, it's logical for the >>kernel >>to pick one and give it all the data. >> >> DS >IMHO, in the second case it's logical for the kernel NOT to allow the >second to bind to the port at all. Which it actually does, it's the >normal case. When you set the SO_REUSEADDR flag on the socket you're >telling the kernel that we're in case 1). > >TJ NO. When you set the SO_REUSEADDR, you are telling the kernel that you intend to share your port with *someone*, but not who. The kernel has no way to know that two processes that bind to the same UDP port with SO_REUSEADDR are the two that were intended to cooperate with each other. For all it knows, one is a foo intended to cooperate with other foo's and the other is a bar intended to cooperate with other bar's. That's why if you mean to share, you should share the actual socket descriptor rather than trying to reference the same transport endpoint with two different sockets. Of course, in this case you don't even need SO_REUSEADDR/SO_REUSEPORT since you only actually open the endpoint once. DS ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Possible bug with UDP and SO_REUSEADDR. 2002-04-29 8:44 ` David Schwartz @ 2002-04-29 10:03 ` Terje Eggestad 2002-04-29 10:38 ` David Schwartz 0 siblings, 1 reply; 54+ messages in thread From: Terje Eggestad @ 2002-04-29 10:03 UTC (permalink / raw) To: David Schwartz; +Cc: linux-kernel On Mon, 2002-04-29 at 10:44, David Schwartz wrote: > > >> 1) The two instances are cooperating closely together and should be > >>sharing > >>a socket (not each opening one), or > >> > >> 2) The two instances are not cooperating closely together and each own > >>their > >>own socket. For all the kernel knows, they don't even know about each > >>other. > >> > >> In the first case, it's logical for whichever one happens to try to > read > >>first to get the/a datagram. In the second case, it's logical for the > >>kernel > >>to pick one and give it all the data. > >> > >> DS > > >IMHO, in the second case it's logical for the kernel NOT to allow the > >second to bind to the port at all. Which it actually does, it's the > >normal case. When you set the SO_REUSEADDR flag on the socket you're > >telling the kernel that we're in case 1). > > > >TJ > > NO. When you set the SO_REUSEADDR, you are telling the kernel that you > intend to share your port with *someone*, but not who. The kernel has no way > to know that two processes that bind to the same UDP port with SO_REUSEADDR > are the two that were intended to cooperate with each other. For all it > knows, one is a foo intended to cooperate with other foo's and the other is a > bar intended to cooperate with other bar's. > > That's why if you mean to share, you should share the actual socket > descriptor rather than trying to reference the same transport endpoint with > two different sockets. > > Of course, in this case you don't even need SO_REUSEADDR/SO_REUSEPORT since > you only actually open the endpoint once. > Well, first of all, I picked up "Unix Network Programming, Networking APIs: Scokets and XTI" by R. Stevens.This is discussed on p.195-196. (With reference to "TCP/IP Illustrated" Vol 2, p.777-779, which I don't have at hand). According to Stevens, duplicate binding to the same address (IP + port) is an multicast/broadcast feature, and the test code I published here a few mail ago is actually illegal on hosts that a) don't implement multicast. b) implement SO_REUSEPORT (Which Linux as of now don't) FYI: In b) the use of SO_REUSEPORT to do bind of duplicate addr is the same as SO_REUSEADDR is now. All parties must set the flag. Stevens further remarks that when a unicast datagram is received on the port only one socket shall receive it, *** and which one is implementation specific. ***!!! *** So the current implementation is NOT a bug. *** (If you believe Stevens that is :-) I do.). I even agree that the *proper* way for two or more programs to share a UDP port is to share the socket, it just create an issue about who shall create the AF_UNIX socket used to pass the descriptor, and what happen when the owner of the AF_UNIX socket dies. (they others will after all most likely continue). Not to mention the extra code needed in the programs to implement the descriptor passing algorithm. However, I still can't see any *practical* use of having one program (me) bind the port, deliberately share it, and another program (you) coming along and want to share it, and then all unicast datagrams are passed to you. Not If I haven't subscribed to any multicast addresses, and no one is sending bcasts, there is no point of me being alive. Can you come up with a real life situation where this make sense? Like I said, it's currently not a bug, and IMHO any behavior should only be changed iff SO_REUSEPORT is implemented. > DS > > TJ -- _________________________________________________________________________ Terje Eggestad mailto:terje.eggestad@scali.no Scali Scalable Linux Systems http://www.scali.com Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE) P.O.Box 150, Oppsal +47 975 31 574 (MOBILE) N-0619 Oslo fax: +47 22 62 89 51 NORWAY _________________________________________________________________________ ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Possible bug with UDP and SO_REUSEADDR. 2002-04-29 10:03 ` Terje Eggestad @ 2002-04-29 10:38 ` David Schwartz 2002-04-29 14:20 ` Terje Eggestad 0 siblings, 1 reply; 54+ messages in thread From: David Schwartz @ 2002-04-29 10:38 UTC (permalink / raw) To: terje.eggestad; +Cc: linux-kernel >However, I still can't see any *practical* use of having one program >(me) bind the port, deliberately share it, and another program (you) >coming along and want to share it, and then all unicast datagrams are >passed to you. Not If I haven't subscribed to any multicast addresses, >and no one is sending bcasts, there is no point of me being alive. > >Can you come up with a real life situation where this make sense? Absolutely. This is actually used in cases where you have a 'default' handler for a protocol that is built into a larger program but want to keep the option to 'override' it with a program with more sophisticated behavior from time to time. In this case, the last socket should get all the data until it goes away. DS ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Possible bug with UDP and SO_REUSEADDR. 2002-04-29 10:38 ` David Schwartz @ 2002-04-29 14:20 ` Terje Eggestad 0 siblings, 0 replies; 54+ messages in thread From: Terje Eggestad @ 2002-04-29 14:20 UTC (permalink / raw) To: David Schwartz; +Cc: linux-kernel On Mon, 2002-04-29 at 12:38, David Schwartz wrote: > > >However, I still can't see any *practical* use of having one program > >(me) bind the port, deliberately share it, and another program (you) > >coming along and want to share it, and then all unicast datagrams are > >passed to you. Not If I haven't subscribed to any multicast addresses, > >and no one is sending bcasts, there is no point of me being alive. > > > >Can you come up with a real life situation where this make sense? > > Absolutely. This is actually used in cases where you have a 'default' > handler for a protocol that is built into a larger program but want to keep > the option to 'override' it with a program with more sophisticated behavior > from time to time. In this case, the last socket should get all the data > until it goes away. > > DS > First of all since we're in agreement that the current behavior is NOT a bug, this discussion is pretty pointless, however I getting worked up. In all fairness, I've a colleague that did an implementation of TCPIP a decade ago, and his in agreement in that the current logic this is the way implementations worked. Thus we're less likely to break things leaving this they way they are. However you logic is broken. First of all, I asked for a case where it make sense, not where it's moronically been done so. If you review your own argument: > That's why if you mean to share, you should share the actual socket > descriptor rather than trying to reference the same transport endpoint > with two different sockets. The program that want to "override" shall connect the first on a AF_UNIX, get the descriptor and be told not to read from the UDP socket until the AF_UNIX socket to the overrider is broken/disconnected. Since according to Stevens, what happen here is implementation specific, the "overriding" you describe is non-portable. If you look at you other argument: > NO. When you set the SO_REUSEADDR, you are telling the > kernel that you intend to share your port with *someone*, but not who. > The kernel has no way to know that two processes that bind to the same > UDP port with SO_REUSEADDR are the two that were intended to > cooperate with each other. For all it knows, one is a foo intended to > cooperate with other foo's and the other is a bar intended to > cooperate with other bar's. The logical deduction from this is that you should never, ever, use bind to the same address for unicast since the kernel don't have the sufficient information to route the datagram correctly. I *COULD* agree to that it should be illegal to duplicate bind to an address. Trouble is now that is actually legal... TJ > > > -- _________________________________________________________________________ Terje Eggestad mailto:terje.eggestad@scali.no Scali Scalable Linux Systems http://www.scali.com Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE) P.O.Box 150, Oppsal +47 975 31 574 (MOBILE) N-0619 Oslo fax: +47 22 62 89 51 NORWAY _________________________________________________________________________ ^ permalink raw reply [flat|nested] 54+ messages in thread
[parent not found: <200204192128.QAA24592@popmail.austin.ibm.com>]
* Re: [PATCH] zerocopy NFS updated [not found] ` <200204192128.QAA24592@popmail.austin.ibm.com> @ 2002-04-20 10:14 ` Hirokazu Takahashi 2002-04-20 15:49 ` Andrew Theurer 0 siblings, 1 reply; 54+ messages in thread From: Hirokazu Takahashi @ 2002-04-20 10:14 UTC (permalink / raw) To: habanero; +Cc: trond.myklebust, linux-kernel Hi, > With all this talk on serialization on UDP, and have a question. first, let > me explain the situation. I have an NFS test which calls 48 clients to read > the same 200 MB file on the same server. I record the time for all the > clients to finish and then calculate the total throughput. The server is a > 4-way IA32. (I used this test to measure the zerocopy/tcp/nfs patch) Now, > right before the test, the 200 MB file is created on the server, so there is > no disk IO at all during the test. It's just an very simple cached read. > Now, when the clients use udp, I can only get a run queue length of 1, and I > have confirmed there is only one nfsd thread in svc_process() at one time, > and I am 65% idle. With tcp, I can get all nfsd threads running, and max all > CPUs. Am I experiencing a bottleneck/serialization due to a single UDP > socket? What version do you use? 2.5.8 kernel has a problem in readahead of NFSD. It doesn't work at all. It can be easy to fix. Thank you, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] zerocopy NFS updated 2002-04-20 10:14 ` [PATCH] zerocopy NFS updated Hirokazu Takahashi @ 2002-04-20 15:49 ` Andrew Theurer 0 siblings, 0 replies; 54+ messages in thread From: Andrew Theurer @ 2002-04-20 15:49 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: trond.myklebust, linux-kernel > Hi, > > > With all this talk on serialization on UDP, and have a question. first, let > > me explain the situation. I have an NFS test which calls 48 clients to read > > the same 200 MB file on the same server. I record the time for all the > > clients to finish and then calculate the total throughput. The server is a > > 4-way IA32. (I used this test to measure the zerocopy/tcp/nfs patch) Now, > > right before the test, the 200 MB file is created on the server, so there is > > no disk IO at all during the test. It's just an very simple cached read. > > Now, when the clients use udp, I can only get a run queue length of 1, and I > > have confirmed there is only one nfsd thread in svc_process() at one time, > > and I am 65% idle. With tcp, I can get all nfsd threads running, and max all > > CPUs. Am I experiencing a bottleneck/serialization due to a single UDP > > socket? > > What version do you use? > 2.5.8 kernel has a problem in readahead of NFSD. > It doesn't work at all. I have this problem on every version I have used, including 2.4.18, 2.4.18 w/ Niel's patches, 2.5.6, and 2.5.7. One other thing I forgot to mention: If I set the number of resident nfsd threads to "2", I can get 2 nfsd threads running at once (nfsd_busy = 2), along with ~30% improvement in throughput. If I use any other qty of resident nfsd threads, I always get exactly 1 nfsd threads running (nfsd_busy = 1) during this test. With tcp there is no serialization at all. I can get nearly 48 nfsd threads busy with the 48 clients all reading at once. -Andrew ^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2002-04-29 14:20 UTC | newest]
Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20020410.190550.83626375.taka@valinux.co.jp.suse.lists.linux.kernel>
2002-04-10 19:32 ` [PATCH] zerocopy NFS updated Andi Kleen
2002-04-11 2:30 ` David S. Miller
2002-04-11 6:46 ` Hirokazu Takahashi
2002-04-11 6:48 ` David S. Miller
2002-04-11 7:41 ` Hirokazu Takahashi
2002-04-11 7:52 ` David S. Miller
2002-04-11 11:38 ` Hirokazu Takahashi
2002-04-11 11:36 ` David S. Miller
2002-04-11 18:00 ` Denis Vlasenko
2002-04-11 13:16 ` Andi Kleen
2002-04-11 17:36 ` Benjamin LaHaise
2002-04-16 0:17 ` Mike Fedyk
2002-04-16 15:37 ` Oliver Xymoron
2002-04-11 17:33 ` Benjamin LaHaise
2002-04-12 8:10 ` Hirokazu Takahashi
2002-04-12 12:30 ` Hirokazu Takahashi
2002-04-12 12:35 ` Andi Kleen
2002-04-12 21:22 ` Jamie Lokier
2002-04-12 21:31 ` David S. Miller
2002-04-13 0:21 ` Jamie Lokier
2002-04-13 6:39 ` Andi Kleen
2002-04-13 8:01 ` Hirokazu Takahashi
2002-04-13 19:19 ` Eric W. Biederman
2002-04-13 19:37 ` Andi Kleen
2002-04-13 20:34 ` Eric W. Biederman
2002-04-24 23:11 ` Mike Fedyk
2002-04-25 17:11 ` Andreas Dilger
2002-04-13 18:52 ` Chris Wedgwood
2002-04-14 0:07 ` Keith Owens
2002-04-14 8:19 ` Chris Wedgwood
2002-04-14 8:40 ` Keith Owens
2002-04-12 21:39 ` David S. Miller
2002-04-15 1:30 ` Hirokazu Takahashi
2002-04-15 4:23 ` David S. Miller
2002-04-16 1:03 ` Hirokazu Takahashi
2002-04-16 1:41 ` Jakob Østergaard
2002-04-16 2:20 ` Hirokazu Takahashi
2002-04-18 5:01 ` Hirokazu Takahashi
2002-04-18 7:58 ` Jakob Østergaard
2002-04-18 8:53 ` Trond Myklebust
2002-04-19 3:21 ` Hirokazu Takahashi
2002-04-19 9:18 ` Trond Myklebust
2002-04-20 7:47 ` Hirokazu Takahashi
2002-04-25 12:37 ` Possible bug with UDP and SO_REUSEADDR. Was " Terje Eggestad
2002-04-26 2:43 ` David S. Miller
2002-04-26 7:38 ` Terje Eggestad
2002-04-29 0:41 ` Possible bug with UDP and SO_REUSEADDR David Schwartz
2002-04-29 8:06 ` Terje Eggestad
2002-04-29 8:44 ` David Schwartz
2002-04-29 10:03 ` Terje Eggestad
2002-04-29 10:38 ` David Schwartz
2002-04-29 14:20 ` Terje Eggestad
[not found] ` <200204192128.QAA24592@popmail.austin.ibm.com>
2002-04-20 10:14 ` [PATCH] zerocopy NFS updated Hirokazu Takahashi
2002-04-20 15:49 ` Andrew Theurer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox