* tuning linux for high network performance? @ 2002-10-23 10:18 Roy Sigurd Karlsbakk 2002-10-23 11:06 ` [RESEND] " Roy Sigurd Karlsbakk 0 siblings, 1 reply; 36+ messages in thread From: Roy Sigurd Karlsbakk @ 2002-10-23 10:18 UTC (permalink / raw) To: netdev; +Cc: Kernel mailing list hi I've got this video server serving video for VoD. problem is the P4 1.8 seems to be maxed out by a few system calls. The below output is for ~50 clients streaming at ~4.5Mbps. if trying to increase this to ~70, the CPU maxes out. Does anyone have an idea? bash-2.05# readprofile | sort -rn +2 | head -30 154203 default_idle 2409.4219 212723 csum_partial_copy_generic 916.9095 100164 handle_IRQ_event 695.5833 24979 system_call 390.2969 37300 e1000_intr 388.5417 119699 ide_intr 340.0540 30598 skb_release_data 273.1964 40740 do_softirq 195.8654 131818 do_wp_page 164.7725 9935 fget 155.2344 24747 kfree 154.6687 10911 del_timer 113.6562 11683 ip_conntrack_find_get 91.2734 4120 sock_poll 85.8333 9357 ip_ct_find_proto 83.5446 5194 sock_wfree 81.1562 4929 add_wait_queue 77.0156 8361 flush_tlb_page 74.6518 4571 remove_wait_queue 71.4219 2191 __brelse 68.4688 29477 skb_clone 68.2338 8562 do_gettimeofday 59.4583 5673 process_timeout 59.0938 11097 tcp_v4_send_check 57.7969 6124 kfree_skbmem 54.6786 17115 tcp_poll 53.4844 21130 nf_hook_slow 52.8250 8299 ip_ct_refresh 51.8687 15429 __kfree_skb 50.7533 1059 lru_cache_del 46.0435 roy -- Roy Sigurd Karlsbakk, Datavaktmester ProntoTV AS - http://www.pronto.tv/ Tel: +47 9801 3356 Computers are like air conditioners. They stop working when you open Windows. ^ permalink raw reply [flat|nested] 36+ messages in thread
* [RESEND] tuning linux for high network performance? 2002-10-23 10:18 tuning linux for high network performance? Roy Sigurd Karlsbakk @ 2002-10-23 11:06 ` Roy Sigurd Karlsbakk 2002-10-23 13:01 ` bert hubert 2002-10-23 18:01 ` [RESEND] tuning linux for high network performance? Denis Vlasenko 0 siblings, 2 replies; 36+ messages in thread From: Roy Sigurd Karlsbakk @ 2002-10-23 11:06 UTC (permalink / raw) To: netdev; +Cc: Kernel mailing list > I've got this video server serving video for VoD. problem is the P4 1.8 > seems to be maxed out by a few system calls. The below output is for ~50 > clients streaming at ~4.5Mbps. if trying to increase this to ~70, the CPU > maxes out. > > Does anyone have an idea? ...adding the whole profile output - sorted by the first column this time... 905182 total 0.4741 121426 csum_partial_copy_generic 474.3203 93633 default_idle 1800.6346 74665 do_wp_page 111.1086 65857 ide_intr 184.9916 53636 handle_IRQ_event 432.5484 21973 do_softirq 107.7108 20498 e1000_intr 244.0238 19800 do_page_fault 16.8081 19395 skb_clone 45.7429 14564 system_call 260.0714 13592 kfree 89.4211 13557 skb_release_data 116.8707 13025 ide_do_request 17.6970 12988 do_rw_disk 8.4557 11841 tcp_sendmsg 2.6814 11720 nf_hook_slow 29.0099 11712 tcp_poll 34.0465 10688 schedule 7.8588 10386 __kfree_skb 34.1645 10052 ipt_do_table 10.1741 8286 fget 115.0833 7436 tcp_v4_send_check 44.2619 7191 e1000_clean_tx_irq 16.6458 7031 kmalloc 18.1211 6610 tcp_write_xmit 9.3892 6241 tcp_clean_rtx_queue 8.0425 6232 ip_conntrack_find_get 51.9333 6140 ide_dmaproc 8.4341 6125 tcp_packet 14.0482 5858 qdisc_restart 15.4158 5734 e1000_xmit_frame 5.6660 5709 tcp_v4_rcv 3.7363 5703 sys_rt_sigprocmask 11.4060 5445 tcp_transmit_skb 3.7500 5273 alloc_skb 11.8761 4961 ide_wait_stat 18.7917 4790 ip_ct_find_proto 44.3519 4782 add_timer 18.3923 4760 ip_ct_refresh 29.7500 4729 do_anonymous_page 17.6455 4616 e1000_clean_rx_irq 4.9106 4464 do_gettimeofday 37.2000 4359 flush_tlb_page 38.9196 4209 ip_finish_output2 16.4414 3731 get_hash_table 23.3188 3714 eth_type_trans 21.1023 3712 __make_request 2.3375 3680 __ip_conntrack_find 12.7778 3480 ip_route_input 9.1579 3363 kfree_skbmem 32.3365 3295 __switch_to 15.2546 3205 fput 13.1352 3143 rmqueue 5.3452 3137 ip_conntrack_in 5.0272 3008 sync_timers 250.6667 2861 sock_wfree 47.6833 2580 ip_queue_xmit 2.0347 2578 process_timeout 26.8542 2577 netif_rx 6.1357 2555 get_user_pages 6.2623 2504 sock_poll 62.6000 2346 ide_build_sglist 5.6394 2316 brw_kiovec 2.5619 2256 csum_partial 7.8333 2251 ip_queue_xmit2 4.2958 2198 start_request 4.0404 2186 dev_queue_xmit 2.7462 2167 timer_bh 2.2203 2162 __free_pages_ok 3.1608 2157 zap_page_range 2.4400 1942 mark_dirty_kiobuf 21.1087 1733 process_backlog 5.9349 1719 tcp_rcv_established 0.8493 1689 add_wait_queue 32.4808 1650 mod_timer 6.1567 1603 wait_kio 17.4239 1575 net_rx_action 4.8611 1554 get_pid 4.0052 1434 lru_cache_add 12.8036 1429 handle_mm_fault 7.7663 1397 ip_local_deliver_finish 4.5357 1357 nf_iterate 10.2803 1350 e1000_alloc_rx_buffers 5.2734 1298 do_select 2.5155 1268 unlock_page 12.1923 1209 submit_bh 10.7946 1184 add_entropy_words 5.9200 1175 __brelse 36.7188 1125 __pollwait 7.8125 1108 shrink_list 5.7708 1099 generic_make_request 3.6151 1080 __free_pages 33.7500 1052 tcp_ack 1.2524 1020 ip_rcv 1.0851 986 raid0_make_request 2.9345 898 ext3_direct_io_get_block 4.7766 883 pfifo_fast_dequeue 11.6184 863 sys_gettimeofday 5.5321 828 tcp_ack_update_window 3.9808 813 ipt_local_out_hook 7.8173 761 __lru_cache_del 6.5603 756 sys_write 2.9531 742 __rdtsc_delay 26.5000 730 uhci_interrupt 3.3182 718 net_tx_action 2.5643 710 batch_entropy_store 3.9444 701 add_timer_randomness 3.3066 666 tasklet_hi_action 4.1625 662 sys_nanosleep 1.7796 635 set_page_dirty 5.4741 627 __tcp_data_snd_check 3.1350 611 netif_receive_skb 1.9094 601 pfifo_fast_enqueue 5.3661 590 del_timer_sync 4.3382 587 lru_cache_del 26.6818 574 get_unmapped_area 2.1103 561 wait_for_tcp_memory 0.7835 557 ip_refrag 5.8021 546 ip_conntrack_local 6.2045 515 sys_select 0.4486 507 __tcp_select_window 2.2634 481 ext3_get_branch 2.2689 433 ip_output 1.2301 389 ip_confirm 9.7250 384 find_vma 4.5714 379 set_bh_page 9.4750 376 tcp_v4_do_rcv 1.0108 370 tcp_ack_no_tstamp 1.4453 368 batch_entropy_process 2.0000 365 ide_build_dmatable 1.1551 364 ip_rcv_finish 0.7696 363 kmem_cache_free 2.8359 362 __wake_up 1.8854 338 ext3_get_block_handle 0.5152 336 inet_sendmsg 5.2500 318 bh_action 2.3382 298 tcp_data_queue 0.1064 272 md_make_request 2.6154 272 ext3_block_to_path 0.9714 248 sock_sendmsg 1.8235 244 __alloc_pages 0.6932 238 kmem_cache_alloc 0.7532 226 __free_pte 3.1389 225 tcp_ack_probe 1.3720 224 __run_task_queue 2.0741 213 ide_get_queue 5.3250 209 __ip_ct_find_proto 4.3542 202 get_sample_stats 1.7414 195 tcp_write_space 1.6810 185 schedule_timeout 1.1859 184 do_signal 0.2992 182 ipt_hook 4.5500 171 generic_direct_IO 0.5552 168 can_share_swap_page 1.8261 157 ip_local_deliver 0.3965 155 get_conntrack_index 2.7679 145 tcp_push_one 0.5664 145 ide_set_handler 1.2500 144 pipe_poll 1.4400 143 max_select_fd 0.8938 139 pte_alloc 0.5792 138 del_timer 1.6429 136 sock_write 0.7234 131 poll_freewait 1.9265 131 getblk 1.7237 130 send_sig_info 0.8553 128 __release_sock 1.4545 125 ret_from_sys_call 7.3529 120 ext3_direct_IO 0.1807 118 tcp_pkt_to_tuple 3.6875 117 find_vma_prev 0.6648 114 do_no_page 0.2298 112 tqueue_bh 4.0000 112 follow_page 1.0769 110 bread 1.1000 108 e1000_rx_checksum 1.2273 107 generic_file_direct_IO 0.1938 102 add_interrupt_randomness 2.5500 97 remove_wait_queue 1.7321 96 mark_page_accessed 2.0000 91 kill_something_info 0.2645 85 invert_tuple 1.9318 81 exit_notify 0.1151 81 cpu_idle 0.9643 80 tcp_new_space 0.6061 79 nf_register_queue_handler 0.5197 75 uhci_remove_pending_qhs 0.3906 69 pdc202xx_dmaproc 0.1250 68 sys_read 0.2656 68 nf_reinject 0.1491 66 map_user_kiobuf 0.2619 65 find_vma_prepare 0.6500 64 generic_file_read 0.2353 61 check_pgt_cache 2.5417 60 free_pages 1.8750 58 error_code 0.9667 57 vm_enough_memory 0.5481 56 __delay 1.4000 55 __const_udelay 1.0577 53 tcp_ioctl 0.0908 53 journal_commit_transaction 0.0132 53 do_munmap 0.0901 52 _alloc_pages 2.1667 51 uhci_finish_completion 0.4554 51 credit_entropy_store 1.1591 50 rh_report_status 0.1953 50 free_page_and_swap_cache 0.8929 49 sys_rt_sigsuspend 0.1750 49 nr_free_pages 0.6125 49 do_mmap_pgoff 0.0394 48 e1000_update_stats 0.0307 48 do_get_write_access 0.0366 48 __journal_file_buffer 0.0916 48 __get_free_pages 2.0000 48 .text.lock.e1000_main 1.7143 47 expand_kiobuf 0.3092 46 uhci_free_pending_qhs 0.4600 46 tcp_parse_options 0.0833 46 kmem_cache_size 5.7500 45 rb_erase 0.2083 44 unmap_kiobuf 0.6111 41 tcp_cwnd_application_limited 0.3106 41 rh_int_timer_do 0.1165 41 init_or_cleanup 0.1424 40 sync_unlocked_inodes 0.0901 40 init_buffer 1.4286 39 .text.lock.ip_input 1.0000 38 vma_merge 0.1301 38 pfifo_fast_requeue 0.6786 38 ip_conntrack_get 0.9500 38 dev_watchdog 0.2209 37 .text.lock.ip_output 0.2803 36 do_check_pgt_cache 0.1731 35 tcp_retrans_try_collapse 0.0576 35 journal_add_journal_head 0.1306 34 ext3_get_inode_loc 0.0914 33 journal_write_revoke_records 0.1964 32 fsync_buffers_list 0.0860 31 filemap_fdatasync 0.1615 31 __pmd_alloc 1.5500 30 sys_wait4 0.0305 30 restore_sigcontext 0.0949 29 sys_sigreturn 0.1169 28 tcp_fastretrans_alert 0.0224 28 do_settimeofday 0.1628 28 do_ide_request 1.4000 27 unmap_fixup 0.0785 27 find_extend_vma 0.1350 27 eth_header_parse 0.8438 27 current_capacity 0.6750 26 save_i387 0.0478 26 __journal_clean_checkpoint_list 0.2407 25 update_atime 0.3125 25 tcp_v4_destroy_sock 0.0718 25 link_path_walk 0.0102 25 buffer_insert_inode_queue 0.2841 25 __journal_unfile_buffer 0.0665 24 sys_mmap2 0.1622 24 rh_send_irq 0.0896 24 rb_insert_color 0.1224 24 ext3_do_update_inode 0.0261 24 balance_dirty_state 0.3158 24 add_wait_queue_exclusive 0.4615 24 __try_to_free_cp_buf 0.4000 23 free_kiobuf_bhs 0.2396 22 tcp_rcv_synsent_state_process 0.0169 22 sys_munmap 0.2619 22 start_this_handle 0.0598 22 sock_rfree 1.3750 22 setup_sigcontext 0.0743 22 flush_tlb_mm 0.1964 22 do_exit 0.0301 22 alloc_kiobuf_bhs 0.1170 22 __rb_erase_color 0.0567 21 tcp_mem_schedule 0.0477 21 setup_frame 0.0482 21 __generic_copy_to_user 0.3500 20 unlock_buffer 0.3125 20 journal_write_metadata_buffer 0.0240 20 d_lookup 0.0704 20 copy_skb_header 0.0980 19 sync_old_buffers 0.1218 19 sock_mmap 0.4750 19 skb_split 0.0344 19 select_bits_alloc 0.7917 19 get_info_ptr 0.2065 17 tcp_write_wakeup 0.0363 17 ret_from_exception 0.6800 17 kiobuf_wait_for_io 0.1062 17 journal_unlock_journal_head 0.1518 17 bad_signal 0.1250 16 tcp_probe_timer 0.0952 16 tcp_close 0.0083 16 ip_route_output_slow 0.0099 16 __mark_inode_dirty 0.0952 16 .text.lock.timer 0.1250 16 .text.lock.tcp 0.0152 15 journal_cancel_revoke 0.0765 15 ext3_bmap 0.1500 15 do_fork 0.0074 15 blk_grow_request_list 0.0833 14 tcp_v4_conn_request 0.0145 14 sync_supers 0.0507 14 log_start_commit 0.0946 14 lock_vma_mappings 0.3500 14 journal_dirty_metadata 0.0354 14 file_read_actor 0.0625 14 __insert_vm_struct 0.1400 13 tcp_time_to_recover 0.0290 13 sys_ioctl 0.0259 13 lookup_swap_cache 0.1625 13 ip_build_xmit_slow 0.0099 13 invalidate_inode_pages 0.0739 13 ext3_dirty_inode 0.0478 13 bmap 0.2955 12 tcp_collapse 0.0143 12 sys_socketcall 0.0234 12 put_filp 0.1364 12 make_pages_present 0.0968 12 journal_get_write_access 0.1304 12 generic_file_write 0.0061 12 e1000_ioctl 0.3333 11 uhci_transfer_result 0.0316 11 tcp_try_to_open 0.0348 11 tcp_recvmsg 0.0045 11 tcp_create_openreq_child 0.0092 11 sys_kill 0.1250 11 schedule_tail 0.0786 11 osync_buffers_list 0.0859 11 journal_stop 0.0255 11 do_sigpending 0.0887 10 tcp_unhash 0.0397 10 tcp_send_probe0 0.0424 10 tcp_rcv_state_process 0.0040 10 sys_poll 0.0138 10 inet_shutdown 0.0208 10 execute_drive_cmd 0.0221 10 __put_unused_buffer_head 0.1136 9 tcp_write_timer 0.0395 9 tcp_send_skb 0.0191 9 tcp_make_synack 0.0082 9 set_buffer_flushtime 0.4500 9 raid0_status 0.2045 9 copy_page_range 0.0205 8 kupdate 0.0274 8 journal_get_descriptor_buffer 0.0741 8 get_empty_filp 0.0253 8 ext3_write_super 0.0741 8 count_active_tasks 0.1111 8 atomic_dec_and_lock 0.1111 8 __lock_page 0.0400 8 __journal_remove_journal_head 0.0250 8 __ip_conntrack_confirm 0.0115 8 __block_prepare_write 0.0105 7 tcp_invert_tuple 0.2188 7 ports_active 0.1346 7 pipe_write 0.0112 7 kjournald 0.0130 7 handle_signal 0.0273 7 grow_buffers 0.0254 7 ext3_get_block 0.0700 7 balance_classzone 0.0151 7 __jbd_kmalloc 0.0625 7 .text.lock.swap 0.1296 6 vsnprintf 0.0057 6 tcp_v4_send_reset 0.0176 6 tcp_accept 0.0105 6 sleep_on 0.0500 6 select_bits_free 0.3750 6 pipe_read 0.0118 6 number 0.0055 6 ip_route_output_key 0.0165 6 inet_accept 0.0136 6 get_unused_buffer_head 0.0375 6 dput 0.0176 6 cleanup_rbuf 0.0273 6 __journal_remove_checkpoint 0.0556 6 __journal_drop_transaction 0.0087 6 __find_get_page 0.0938 6 .text.lock.netfilter 0.0260 5 vmtruncate_list 0.0625 5 vfs_permission 0.0208 5 tcp_v4_hnd_req 0.0147 5 tcp_init_cwnd 0.0500 5 tcp_check_urg 0.0158 5 tcp_check_sack_reneging 0.0240 5 sys_fork 0.1786 5 sock_setsockopt 0.0034 5 sock_init_data 0.0161 5 sock_def_readable 0.0521 5 release_x86_irqs 0.0595 5 release_task 0.0109 5 refile_buffer 0.1389 5 pipe_release 0.0368 5 path_init 0.0129 5 nr_free_buffer_pages 0.0625 5 mprotect_fixup 0.0043 5 log_space_left 0.1562 5 ll_rw_block 0.0119 5 journal_start 0.0272 5 init_bh 0.2083 5 get_zeroed_page 0.1389 5 ext3_commit_write 0.0078 5 e1000_tx_timeout 0.2500 5 do_poll 0.0227 5 bdfind 0.1389 5 add_keyboard_randomness 0.1250 5 __wait_on_buffer 0.0338 5 __vma_link 0.0284 5 __tcp_mem_reclaim 0.0595 5 __rb_rotate_left 0.0781 4 write_profile 0.0244 4 tcp_v4_syn_recv_sock 0.0064 4 tcp_v4_search_req 0.0278 4 tcp_v4_route_req 0.0192 4 tcp_v4_init_sock 0.0169 4 tcp_cwnd_restart 0.0263 4 tcp_check_req 0.0043 4 tcp_check_reno_reordering 0.0500 4 sys_mprotect 0.0078 4 strncpy_from_user 0.0500 4 sock_def_wakeup 0.0625 4 sock_alloc 0.0208 4 skb_copy_datagram_iovec 0.0071 4 lookup_mnt 0.0476 4 locks_remove_posix 0.0096 4 invalidate_inode_buffers 0.0370 4 init_conntrack 0.0043 4 halfMD4Transform 0.0068 4 find_or_create_page 0.0164 4 filp_close 0.0238 4 ext3_reserve_inode_write 0.0233 4 ext3_find_goal 0.0213 4 do_fcntl 0.0059 4 dnotify_flush 0.0345 4 d_alloc 0.0105 4 add_blkdev_randomness 0.0526 4 _stext 0.0500 4 __journal_insert_checkpoint 0.0167 4 __find_lock_page_helper 0.0323 4 .text.lock.inode 0.0086 3 wait_for_tcp_connect 0.0054 3 tcp_v4_get_port 0.0045 3 tcp_put_port 0.0150 3 tcp_init_xmit_timers 0.0221 3 tcp_clear_xmit_timers 0.0234 3 tcp_add_reno_sack 0.0357 3 sys_sched_getscheduler 0.0288 3 sys_fcntl64 0.0221 3 sys_accept 0.0119 3 sock_ioctl 0.0268 3 sock_fasync 0.0038 3 sock_def_error_report 0.0312 3 rt_check_expire__thr 0.0077 3 rh_init_int_timer 0.0278 3 reset_hc 0.0167 3 register_gifconf 0.0938 3 read_chan 0.0016 3 put_unused_buffer_head 0.0833 3 pipe_ioctl 0.0375 3 permission 0.0227 3 open_namei 0.0024 3 mm_release 0.0833 3 locks_remove_flock 0.0163 3 ksoftirqd 0.0153 3 journal_file_buffer 0.0682 3 iput 0.0060 3 ip_build_and_send_pkt 0.0067 3 interruptible_sleep_on 0.0250 3 inet_sock_destruct 0.0080 3 inet_ioctl 0.0079 3 inet_create 0.0048 3 immediate_bh 0.1071 3 get_unused_fd 0.0077 3 get_empty_inode 0.0179 3 flush_tlb_all_ipi 0.0395 3 filemap_fdatawait 0.0214 3 fd_install 0.0441 3 ext3_prepare_write 0.0056 3 ext3_mark_iloc_dirty 0.0357 3 e1000_watchdog 0.0064 3 e1000_read_phy_reg 0.0179 3 d_invalidate 0.0214 3 create_buffers 0.0125 3 cp_new_stat64 0.0095 3 copy_mm 0.0040 3 copy_files 0.0043 3 bdget 0.0078 3 __insert_into_lru_list 0.0300 3 __global_restore_flags 0.0417 3 __get_user_4 0.1250 2 write_ldt 0.0037 2 walk_page_buffers 0.0161 2 tcp_try_undo_partial 0.0093 2 tcp_try_undo_dsack 0.0294 2 tcp_send_ack 0.0100 2 tcp_retransmit_skb 0.0034 2 tcp_new 0.0333 2 tcp_init_metrics 0.0063 2 tcp_fragment 0.0029 2 tcp_fixup_sndbuf 0.0455 2 tcp_enter_loss 0.0051 2 tcp_destroy_sock 0.0043 2 tcp_close_state 0.0104 2 tcp_child_process 0.0134 2 tcp_bucket_create 0.0263 2 tasklet_init 0.0500 2 sys_close 0.0179 2 sock_recvmsg 0.0116 2 sock_map_fd 0.0052 2 sk_free 0.0172 2 sk_alloc 0.0208 2 sem_exit 0.0038 2 reschedule 0.1667 2 put_files_struct 0.0109 2 path_release 0.0417 2 path_lookup 0.0556 2 mmput 0.0172 2 kiobuf_init 0.0238 2 journal_unfile_buffer 0.0556 2 journal_get_undo_access 0.0070 2 journal_dirty_data 0.0047 2 ip_mc_drop_socket 0.0156 2 idedisk_open 0.0156 2 grow_dev_page 0.0122 2 getname 0.0128 2 generic_unplug_device 0.0333 2 generic_file_llseek 0.0135 2 free_kiovec 0.0200 2 flush_signal_handlers 0.0333 2 filemap_nopage 0.0040 2 ext3_writepage_trans_blocks 0.0152 2 ext3_getblk 0.0030 2 do_generic_file_read 0.0017 2 destroy_inode 0.0455 2 deliver_to_old_ones 0.0114 2 copy_namespace 0.0023 2 clear_inode 0.0122 2 clean_inode 0.0109 2 block_prepare_write 0.0179 2 alloc_kiovec 0.0161 2 add_page_to_hash_queue 0.0455 2 activate_page 0.0139 2 __tcp_v4_lookup_listener 0.0208 2 __journal_refile_buffer 0.0088 2 __generic_copy_from_user 0.0227 2 __find_lock_page 0.0500 2 __down_trylock 0.0263 2 __down_failed_trylock 0.1667 2 __block_commit_write 0.0098 2 .text.lock.sched 0.0042 1 vt_console_device 0.0250 1 vgacon_save_screen 0.0114 1 udp_sendmsg 0.0010 1 tty_write 0.0015 1 tty_ioctl 0.0011 1 tcp_xmit_retransmit_queue 0.0010 1 tcp_xmit_probe_skb 0.0086 1 tcp_v4_synq_add 0.0063 1 tcp_v4_rebuild_header 0.0028 1 tcp_timewait_kill 0.0045 1 tcp_sync_mss 0.0081 1 tcp_reset_keepalive_timer 0.0250 1 tcp_reset 0.0039 1 tcp_recv_urg 0.0044 1 tcp_incr_quickack 0.0167 1 tcp_error 0.0139 1 sys_time 0.0119 1 sys_stat64 0.0086 1 sys_modify_ldt 0.0106 1 sys_lstat64 0.0089 1 sys_llseek 0.0034 1 sys_getppid 0.0250 1 sys_getpeername 0.0081 1 sys_fstat64 0.0104 1 sys_clone 0.0250 1 sys_brk 0.0042 1 sys_access 0.0034 1 svc_udp_recvfrom 0.0014 1 sock_wmalloc 0.0125 1 sock_release 0.0104 1 sock_read 0.0064 1 sock_create 0.0036 1 skb_recv_datagram 0.0042 1 show_mem 0.0033 1 setup_rt_frame 0.0015 1 setscheduler 0.0024 1 secure_tcp_sequence_number 0.0051 1 restart_request 0.0132 1 remove_inode_page 0.0192 1 remove_expectations 0.0208 1 proc_pid_lookup 0.0020 1 proc_lookup 0.0068 1 pdc202xx_reset 0.0074 1 path_walk 0.0357 1 opost 0.0023 1 old_mmap 0.0033 1 normal_poll 0.0035 1 nfs3svc_encode_attrstat 0.0020 1 n_tty_receive_buf 0.0002 1 move_addr_to_user 0.0119 1 mm_init 0.0051 1 memory_open 0.0050 1 kmem_cache_grow 0.0018 1 kill_fasync 0.0172 1 journal_free_journal_head 0.0500 1 journal_bmap 0.0089 1 journal_blocks_per_page 0.0312 1 journal_alloc_journal_head 0.0096 1 is_read_only 0.0147 1 ip_ct_gather_frags 0.0031 1 init_private_file 0.0093 1 init_once 0.0038 1 init_buffer_head 0.0182 1 inet_release 0.0125 1 inet_getname 0.0083 1 inet_autobind 0.0023 1 get_pipe_inode 0.0057 1 free_pgtables 0.0071 1 fn_hash_lookup 0.0045 1 find_inlist_lock 0.0035 1 file_move 0.0139 1 fcntl_dirnotify 0.0032 1 ext3_write_inode 0.0192 1 ext3_test_allocatable 0.0156 1 ext3_release_file 0.0357 1 ext3_read_inode 0.0014 1 ext3_open_file 0.0250 1 ext3_group_sparse 0.0104 1 ext3_file_write 0.0053 1 exit_sighand 0.0100 1 e1000_tbi_adjust_stats 0.0021 1 e1000_check_for_link 0.0020 1 do_timer 0.0125 1 do_tcp_sendpages 0.0004 1 do_sys_settimeofday 0.0064 1 do_readv_writev 0.0016 1 do_pollfd 0.0074 1 death_by_timeout 0.0068 1 d_instantiate 0.0139 1 cpu_raise_softirq 0.0154 1 copy_thread 0.0071 1 clear_page_tables 0.0046 1 clean_from_lists 0.0139 1 check_unthrottle 0.0208 1 change_protection 0.0027 1 cached_lookup 0.0119 1 add_to_page_cache_locked 0.0081 1 __user_walk 0.0156 1 __remove_inode_page 0.0104 1 __remove_from_lru_list 0.0119 1 __refile_buffer 0.0109 1 __rb_rotate_right 0.0156 1 __loop_delay 0.0250 1 .text.lock.super 0.0071 -- Roy Sigurd Karlsbakk, Datavaktmester ProntoTV AS - http://www.pronto.tv/ Tel: +47 9801 3356 Computers are like air conditioners. They stop working when you open Windows. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 11:06 ` [RESEND] " Roy Sigurd Karlsbakk @ 2002-10-23 13:01 ` bert hubert 2002-10-23 13:21 ` David S. Miller ` (2 more replies) 2002-10-23 18:01 ` [RESEND] tuning linux for high network performance? Denis Vlasenko 1 sibling, 3 replies; 36+ messages in thread From: bert hubert @ 2002-10-23 13:01 UTC (permalink / raw) To: Roy Sigurd Karlsbakk; +Cc: netdev, Kernel mailing list On Wed, Oct 23, 2002 at 01:06:18PM +0200, Roy Sigurd Karlsbakk wrote: > > I've got this video server serving video for VoD. problem is the P4 1.8 > > seems to be maxed out by a few system calls. The below output is for ~50 > > clients streaming at ~4.5Mbps. if trying to increase this to ~70, the CPU > > maxes out. '50 clients *each* streaming at ~4.4MBps', better make that clear, otherwise something is *very* broken. Also mention that you have an e1000 card which does not do outgoing checksumming. You'd think that a kernel would be able to do 250megabits of TCP checksums though. > ...adding the whole profile output - sorted by the first column this time... > > 905182 total 0.4741 > 121426 csum_partial_copy_generic 474.3203 > 93633 default_idle 1800.6346 > 74665 do_wp_page 111.1086 Perhaps the 'copy' also entails grabbing the page from disk, leading to inflated csum_partial_copy_generic stats? Where are you serving from? Regards, bert -- http://www.PowerDNS.com Versatile DNS Software & Services http://lartc.org Linux Advanced Routing & Traffic Control HOWTO ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 13:01 ` bert hubert @ 2002-10-23 13:21 ` David S. Miller 2002-10-23 13:42 ` Roy Sigurd Karlsbakk 2002-10-23 13:41 ` [RESEND] tuning linux for high network performance? Roy Sigurd Karlsbakk 2002-10-23 14:59 ` Nivedita Singhvi 2 siblings, 1 reply; 36+ messages in thread From: David S. Miller @ 2002-10-23 13:21 UTC (permalink / raw) To: bert hubert; +Cc: Roy Sigurd Karlsbakk, netdev, Kernel mailing list On Wed, 2002-10-23 at 06:01, bert hubert wrote: > Also mention that you have an e1000 card which > does not do outgoing checksumming. The e1000 can very well do hardware checksumming on transmit. The missing piece of the puzzle is that his application is not using sendfile(), without which no transmit checksum offload can take place. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 13:21 ` David S. Miller @ 2002-10-23 13:42 ` Roy Sigurd Karlsbakk 2002-10-23 17:01 ` bert hubert 2002-10-24 4:11 ` David S. Miller 0 siblings, 2 replies; 36+ messages in thread From: Roy Sigurd Karlsbakk @ 2002-10-23 13:42 UTC (permalink / raw) To: David S. Miller, bert hubert; +Cc: netdev, Kernel mailing list > The e1000 can very well do hardware checksumming on transmit. > > The missing piece of the puzzle is that his application is not > using sendfile(), without which no transmit checksum offload > can take place. As far as I've understood, sendfile() won't do much good with large files. Is this right? We're talking of 3-6GB files here ... roy -- Roy Sigurd Karlsbakk, Datavaktmester ProntoTV AS - http://www.pronto.tv/ Tel: +47 9801 3356 Computers are like air conditioners. They stop working when you open Windows. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 13:42 ` Roy Sigurd Karlsbakk @ 2002-10-23 17:01 ` bert hubert 2002-10-23 17:10 ` Ben Greear ` (2 more replies) 2002-10-24 4:11 ` David S. Miller 1 sibling, 3 replies; 36+ messages in thread From: bert hubert @ 2002-10-23 17:01 UTC (permalink / raw) To: Roy Sigurd Karlsbakk; +Cc: David S. Miller, netdev, Kernel mailing list On Wed, Oct 23, 2002 at 03:42:48PM +0200, Roy Sigurd Karlsbakk wrote: > > The e1000 can very well do hardware checksumming on transmit. > > > > The missing piece of the puzzle is that his application is not > > using sendfile(), without which no transmit checksum offload > > can take place. > > As far as I've understood, sendfile() won't do much good with large files. Is > this right? I still refuse to believe that a 1.8GHz Pentium4 can only checksum 250megabits/second. MD Raid5 does better and they probably don't use a checksum as braindead as that used by TCP. If the checksumming is not the problem, the copying is, which would be a weakness of your hardware. The function profiled does both the copying and the checksumming. But 250megabits/second also seems low. Dave? Regards, bert -- http://www.PowerDNS.com Versatile DNS Software & Services http://lartc.org Linux Advanced Routing & Traffic Control HOWTO ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 17:01 ` bert hubert @ 2002-10-23 17:10 ` Ben Greear 2002-10-23 17:11 ` Richard B. Johnson 2002-10-23 17:12 ` Nivedita Singhvi 2 siblings, 0 replies; 36+ messages in thread From: Ben Greear @ 2002-10-23 17:10 UTC (permalink / raw) To: bert hubert Cc: Roy Sigurd Karlsbakk, David S. Miller, netdev, Kernel mailing list bert hubert wrote: > I still refuse to believe that a 1.8GHz Pentium4 can only checksum > 250megabits/second. MD Raid5 does better and they probably don't use a > checksum as braindead as that used by TCP. For what it's worth, I have been able to send and receive 400+ Mbps of traffic, by directional, on the same machine (ie, about 1600 Mbps of payload across the PCI bus) So, it's probably not the e1000 or networking code that is slowing you down. (This was on a 64/66 PCI, Dual-AMD 2Ghz machine though, are you running only 32/33 PCI? If not, where did you find this motherboard!) Have you tried just reading the information from disk and doing everying except the final 'send/write/sendto' ? That would help determine if it is your file reads that are killing you. Ben -- Ben Greear <greearb@candelatech.com> <Ben_Greear AT excite.com> President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 17:01 ` bert hubert 2002-10-23 17:10 ` Ben Greear @ 2002-10-23 17:11 ` Richard B. Johnson 2002-10-23 17:12 ` Nivedita Singhvi 2 siblings, 0 replies; 36+ messages in thread From: Richard B. Johnson @ 2002-10-23 17:11 UTC (permalink / raw) To: bert hubert Cc: Roy Sigurd Karlsbakk, David S. Miller, netdev, Kernel mailing list On Wed, 23 Oct 2002, bert hubert wrote: > On Wed, Oct 23, 2002 at 03:42:48PM +0200, Roy Sigurd Karlsbakk wrote: > > > The e1000 can very well do hardware checksumming on transmit. > > > > > > The missing piece of the puzzle is that his application is not > > > using sendfile(), without which no transmit checksum offload > > > can take place. > > > > As far as I've understood, sendfile() won't do much good with large files. Is > > this right? > > I still refuse to believe that a 1.8GHz Pentium4 can only checksum > 250megabits/second. MD Raid5 does better and they probably don't use a > checksum as braindead as that used by TCP. > > If the checksumming is not the problem, the copying is, which would be a > weakness of your hardware. The function profiled does both the copying and > the checksumming. > > But 250megabits/second also seems low. > > Dave? > Ordinary DUAL Pentium 400 MHz machine does this... Calculating CPU speed...done Testing checksum speed...done Testing RAM copy...done Testing I/O port speed...done CPU Clock = 400 MHz checksum speed = 685 Mb/s RAM copy = 1549 Mb/s I/O port speed = 654 kb/s This is 685 megaBYTES per second. checksum speed = 685 Mb/s Cheers, Dick Johnson Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips). Bush : The Fourth Reich of America ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 17:01 ` bert hubert 2002-10-23 17:10 ` Ben Greear 2002-10-23 17:11 ` Richard B. Johnson @ 2002-10-23 17:12 ` Nivedita Singhvi 2002-10-23 17:56 ` Richard B. Johnson 2 siblings, 1 reply; 36+ messages in thread From: Nivedita Singhvi @ 2002-10-23 17:12 UTC (permalink / raw) To: bert hubert Cc: Roy Sigurd Karlsbakk, David S. Miller, netdev, Kernel mailing list bert hubert wrote: > I still refuse to believe that a 1.8GHz Pentium4 can only checksum > 250megabits/second. MD Raid5 does better and they probably don't use a > checksum as braindead as that used by TCP. > > If the checksumming is not the problem, the copying is, which would be a > weakness of your hardware. The function profiled does both the copying and > the checksumming. Yep, its not so much the checksumming as the fact that this is done over each byte of data and copied. thanks, Nivedita ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 17:12 ` Nivedita Singhvi @ 2002-10-23 17:56 ` Richard B. Johnson 2002-10-23 18:07 ` Nivedita Singhvi 0 siblings, 1 reply; 36+ messages in thread From: Richard B. Johnson @ 2002-10-23 17:56 UTC (permalink / raw) To: Nivedita Singhvi Cc: bert hubert, Roy Sigurd Karlsbakk, David S. Miller, netdev, Kernel mailing list On Wed, 23 Oct 2002, Nivedita Singhvi wrote: > bert hubert wrote: > > > I still refuse to believe that a 1.8GHz Pentium4 can only checksum > > 250megabits/second. MD Raid5 does better and they probably don't use a > > checksum as braindead as that used by TCP. > > > > If the checksumming is not the problem, the copying is, which would be a > > weakness of your hardware. The function profiled does both the copying and > > the checksumming. > > Yep, its not so much the checksumming as the fact that this is > done over each byte of data and copied. > > thanks, > Nivedita No. It's done over each word (short int) and the actual summation takes place during the address calculation of the next word. This gets you a checksum that is practically free. A 400 MHz ix86 CPU will checksum/copy at 685 megabytes per second. It will copy at 1,549 megabytes per second. Those are megaBYTES! If you have slow network performance it has nothing to do with either copy or checksum. Data transmission acts like a low-pass filter. The dominant pole of that transfer function determines the speed, that's why it's called dominant. If you measure a data-rate of 10 megabytes/second. Nothing you do with copy or checksum will affect it to any significant extent. If you have a data-rate of 100 megabytes per second, then any tinkering with copy will have an effective improvement ratio of 100/1,559 ~= 0.064. If you have a data rate of 100 megabytes per second and you tinker with checksum, you get an improvement ratio of 100/685 ~=0.14. These are just not the things that are affecting your performance. If you were to double the checksumming speed, you increase the throughput by 2 * 0.14 = 0.28 with the parameters shown. The TCP/IP checksum is quite nice. It may have been discovered by accident, but it's still nice. It works regardless of whether you have a little endian or big endian machine. It also doesn't wrap so you don't (usually) show a good checksum when the data is bad. It does have the characteristic that if all the bits are inverted, it will checksum good. However, there are not too many real-world scenarios that would result in this inversion. So it's not "brain-dead" as you state. A hardware checksum is really quick because it's really easy. Cheers, Dick Johnson Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips). Bush : The Fourth Reich of America ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 17:56 ` Richard B. Johnson @ 2002-10-23 18:07 ` Nivedita Singhvi 2002-10-23 18:30 ` Richard B. Johnson 0 siblings, 1 reply; 36+ messages in thread From: Nivedita Singhvi @ 2002-10-23 18:07 UTC (permalink / raw) To: root Cc: bert hubert, Roy Sigurd Karlsbakk, David S. Miller, netdev, Kernel mailing list "Richard B. Johnson" wrote: > No. It's done over each word (short int) and the actual summation > takes place during the address calculation of the next word. This > gets you a checksum that is practically free. Yep, sorry, word, not byte. My bad. The cost is in the fact that this whole process involves loading each word of the data stream into a register. Which is why I also used to consider the checksum cost as negligible. > A 400 MHz ix86 CPU will checksum/copy at 685 megabytes per second. > It will copy at 1,549 megabytes per second. Those are megaBYTES! But then why the difference in the checksum/copy and copy? Are you saying the checksum is not costing you 864 megabytes a second?? thanks, Nivedita ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 18:07 ` Nivedita Singhvi @ 2002-10-23 18:30 ` Richard B. Johnson 0 siblings, 0 replies; 36+ messages in thread From: Richard B. Johnson @ 2002-10-23 18:30 UTC (permalink / raw) To: Nivedita Singhvi Cc: bert hubert, Roy Sigurd Karlsbakk, David S. Miller, netdev, Kernel mailing list On Wed, 23 Oct 2002, Nivedita Singhvi wrote: > "Richard B. Johnson" wrote: > > > No. It's done over each word (short int) and the actual summation > > takes place during the address calculation of the next word. This > > gets you a checksum that is practically free. > > Yep, sorry, word, not byte. My bad. The cost is in the fact > that this whole process involves loading each word of the data > stream into a register. Which is why I also used to consider > the checksum cost as negligible. > > > A 400 MHz ix86 CPU will checksum/copy at 685 megabytes per second. > > It will copy at 1,549 megabytes per second. Those are megaBYTES! > > But then why the difference in the checksum/copy and copy? > Are you saying the checksum is not costing you 864 megabytes > a second?? Costing you 864 megabytes per second? Lets say the checksum was free. You are then able to INF bytes/per/sec. So it's costing you INF bytes/per/sec? No, it's costing you nothing. If we were not dealing with INF, then 'Cost' is approximately 1/N, not N. Cost is work_done_without_checksum - work_done_with_checksum. Because of the low-pass filter pole, these numbers are practically the same. But, you can get a measurable difference between any two large numbers. This makes the 'cost' seem high. You need to make it relative to make any sense, so a 'goodness' can be expressed as a ratio of the cost and the work having been done. Cheers, Dick Johnson Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips). Bush : The Fourth Reich of America ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 13:42 ` Roy Sigurd Karlsbakk 2002-10-23 17:01 ` bert hubert @ 2002-10-24 4:11 ` David S. Miller 2002-10-24 9:37 ` Karen Shaeffer 2002-10-24 10:30 ` sendfile64() anyone? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk 1 sibling, 2 replies; 36+ messages in thread From: David S. Miller @ 2002-10-24 4:11 UTC (permalink / raw) To: Roy Sigurd Karlsbakk; +Cc: bert hubert, netdev, Kernel mailing list On Wed, 2002-10-23 at 06:42, Roy Sigurd Karlsbakk wrote: > As far as I've understood, sendfile() won't do much good with large files. Is > this right? There is always a benefit to using sendfile(), when you use sendfile() the cpu doesn't touch one byte of the data if the network card support TX checksumming. The disk DMAs to ram, then the net card DMAs from ram. Simple as that. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-24 4:11 ` David S. Miller @ 2002-10-24 9:37 ` Karen Shaeffer 2002-10-24 10:30 ` sendfile64() anyone? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk 1 sibling, 0 replies; 36+ messages in thread From: Karen Shaeffer @ 2002-10-24 9:37 UTC (permalink / raw) To: David S. Miller; +Cc: netdev On Wed, Oct 23, 2002 at 09:11:09PM -0700, David S. Miller wrote: > On Wed, 2002-10-23 at 06:42, Roy Sigurd Karlsbakk wrote: > > As far as I've understood, sendfile() won't do much good with large files. Is > > this right? > > There is always a benefit to using sendfile(), when you use > sendfile() the cpu doesn't touch one byte of the data if > the network card support TX checksumming. The disk DMAs > to ram, then the net card DMAs from ram. Simple as that. Referring to: $ rpm -qf /usr/include/sys/sendfile.h glibc-devel-2.2.5-40 quoting "sendfile.h" #ifdef __USE_FILE_OFFSET64 # error "<sys/sendfile.h> cannot be used with _FILE_OFFSET_BITS=64" #endif So, how does one use sendfile() for large files that are greater than 2 GBytes? Am I missing something? Thanks, Karen -- Karen Shaeffer Neuralscape; Santa Cruz, Ca. 95060 shaeffer@neuralscape.com http://www.neuralscape.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* sendfile64() anyone? (was [RESEND] tuning linux for high network performance?) 2002-10-24 4:11 ` David S. Miller 2002-10-24 9:37 ` Karen Shaeffer @ 2002-10-24 10:30 ` Roy Sigurd Karlsbakk 2002-10-24 10:47 ` David S. Miller 1 sibling, 1 reply; 36+ messages in thread From: Roy Sigurd Karlsbakk @ 2002-10-24 10:30 UTC (permalink / raw) To: David S. Miller; +Cc: bert hubert, netdev, Kernel mailing list On Thursday 24 October 2002 06:11, David S. Miller wrote: > On Wed, 2002-10-23 at 06:42, Roy Sigurd Karlsbakk wrote: > > As far as I've understood, sendfile() won't do much good with large > > files. Is this right? > > There is always a benefit to using sendfile(), when you use > sendfile() the cpu doesn't touch one byte of the data if > the network card support TX checksumming. The disk DMAs > to ram, then the net card DMAs from ram. Simple as that. Are there any plans of implementing sendfile64() or sendfile() support for -D_FILE_OFFSET_BITS=64? (from man 2 sendfile) ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count); int main() { ssize_t s1; size_t count; off_t offset; printf("sizeof ssize_t: %d\n", sizeof s1); printf("sizeof size_t: %d\n", sizeof count); printf("sizeof off_t: %d\n", sizeof offset); return 0; } $ make ... $ ./sendfile_test sizeof ssize_t: 4 sizeof size_t: 4 sizeof off_t: 4 $ and - when attempting to build this with -D_FILE_OFFSET_BITS=64 [roy@roy-sin micro_httpd-O_DIRECT]$ make sendfile_test gcc -D_DEBUG -Wall -W -D_GNU_SOURCE -D_NO_DIR_ACCESS -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DUSE_O_DIRECT -DINETD -Wno-unused -O0 -ggdb -c sendfile_test.c In file included from sendfile_test.c:1: /usr/include/sys/sendfile.h:26: #error "<sys/sendfile.h> cannot be used with _FILE_OFFSET_BITS=64" make: *** [sendfile_test.o] Error 1 -- Roy Sigurd Karlsbakk, Datavaktmester ProntoTV AS - http://www.pronto.tv/ Tel: +47 9801 3356 Computers are like air conditioners. They stop working when you open Windows. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: sendfile64() anyone? (was [RESEND] tuning linux for high network performance?) 2002-10-24 10:30 ` sendfile64() anyone? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk @ 2002-10-24 10:47 ` David S. Miller 2002-10-24 11:07 ` Roy Sigurd Karlsbakk 0 siblings, 1 reply; 36+ messages in thread From: David S. Miller @ 2002-10-24 10:47 UTC (permalink / raw) To: Roy Sigurd Karlsbakk; +Cc: bert hubert, netdev, Kernel mailing list On Thu, 2002-10-24 at 03:30, Roy Sigurd Karlsbakk wrote: > Are there any plans of implementing sendfile64() or sendfile() support for > -D_FILE_OFFSET_BITS=64? This is old hat, and appears in every current vendor kernel I am aware of and is in 2.5.x as well. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: sendfile64() anyone? (was [RESEND] tuning linux for high network performance?) 2002-10-24 10:47 ` David S. Miller @ 2002-10-24 11:07 ` Roy Sigurd Karlsbakk 0 siblings, 0 replies; 36+ messages in thread From: Roy Sigurd Karlsbakk @ 2002-10-24 11:07 UTC (permalink / raw) To: David S. Miller; +Cc: bert hubert, netdev, Kernel mailing list On Thursday 24 October 2002 12:47, David S. Miller wrote: > On Thu, 2002-10-24 at 03:30, Roy Sigurd Karlsbakk wrote: > > Are there any plans of implementing sendfile64() or sendfile() support > > for -D_FILE_OFFSET_BITS=64? > > This is old hat, and appears in every current vendor kernel I am > aware of and is in 2.5.x as well. then where can I find these patches? I cannot use 2.5, and I usually try to stick with an official kernel. and - if this patch has been around all this time... why isn't it in the official kernel yet? -- Roy Sigurd Karlsbakk, Datavaktmester ProntoTV AS - http://www.pronto.tv/ Tel: +47 9801 3356 Computers are like air conditioners. They stop working when you open Windows. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 13:01 ` bert hubert 2002-10-23 13:21 ` David S. Miller @ 2002-10-23 13:41 ` Roy Sigurd Karlsbakk 2002-10-23 14:59 ` Nivedita Singhvi 2 siblings, 0 replies; 36+ messages in thread From: Roy Sigurd Karlsbakk @ 2002-10-23 13:41 UTC (permalink / raw) To: bert hubert; +Cc: netdev, Kernel mailing list > '50 clients *each* streaming at ~4.4MBps', better make that clear, > otherwise something is *very* broken. Also mention that you have an e1000 > card which does not do outgoing checksumming. just to clearify s/MBps/Mbps/ s/bps/bits per second/ > You'd think that a kernel would be able to do 250megabits of TCP checksums > though. > > > ...adding the whole profile output - sorted by the first column this > > time... > > > > 905182 total 0.4741 > > 121426 csum_partial_copy_generic 474.3203 > > 93633 default_idle 1800.6346 > > 74665 do_wp_page 111.1086 > > Perhaps the 'copy' also entails grabbing the page from disk, leading to > inflated csum_partial_copy_generic stats? I really don't know. Just to clearify a little more - the server app uses O_DIRECT to read the data before tossing it to the socket. > Where are you serving from? What do you mean? roy -- Roy Sigurd Karlsbakk, Datavaktmester ProntoTV AS - http://www.pronto.tv/ Tel: +47 9801 3356 Computers are like air conditioners. They stop working when you open Windows. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 13:01 ` bert hubert 2002-10-23 13:21 ` David S. Miller 2002-10-23 13:41 ` [RESEND] tuning linux for high network performance? Roy Sigurd Karlsbakk @ 2002-10-23 14:59 ` Nivedita Singhvi 2002-10-23 15:26 ` O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk 2 siblings, 1 reply; 36+ messages in thread From: Nivedita Singhvi @ 2002-10-23 14:59 UTC (permalink / raw) To: bert hubert; +Cc: Roy Sigurd Karlsbakk, netdev, Kernel mailing list bert hubert wrote: > > ...adding the whole profile output - sorted by the first column this time... > > > > 905182 total 0.4741 > > 121426 csum_partial_copy_generic 474.3203 > > 93633 default_idle 1800.6346 > > 74665 do_wp_page 111.1086 > > Perhaps the 'copy' also entails grabbing the page from disk, leading to > inflated csum_partial_copy_generic stats? I think this is strictly a copy from user space->kernel and vice versa. This shouldnt include the disk access etc. thanks, Nivedita ^ permalink raw reply [flat|nested] 36+ messages in thread
* O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?) 2002-10-23 14:59 ` Nivedita Singhvi @ 2002-10-23 15:26 ` Roy Sigurd Karlsbakk 2002-10-23 16:34 ` Nivedita Singhvi 0 siblings, 1 reply; 36+ messages in thread From: Roy Sigurd Karlsbakk @ 2002-10-23 15:26 UTC (permalink / raw) To: Nivedita Singhvi, bert hubert; +Cc: netdev, Kernel mailing list On Wednesday 23 October 2002 16:59, Nivedita Singhvi wrote: > bert hubert wrote: > > > ...adding the whole profile output - sorted by the first column this > > > time... > > > > > > 905182 total 0.4741 > > > 121426 csum_partial_copy_generic 474.3203 > > > 93633 default_idle 1800.6346 > > > 74665 do_wp_page 111.1086 > > > > Perhaps the 'copy' also entails grabbing the page from disk, leading to > > inflated csum_partial_copy_generic stats? > > I think this is strictly a copy from user space->kernel and vice versa. > This shouldnt include the disk access etc. hm I'm doing O_DIRECT read (from disk), so it needs to be user -> kernel, then. any chance of using O_DIRECT to the socket? -- Roy Sigurd Karlsbakk, Datavaktmester ProntoTV AS - http://www.pronto.tv/ Tel: +47 9801 3356 Computers are like air conditioners. They stop working when you open Windows. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?) 2002-10-23 15:26 ` O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk @ 2002-10-23 16:34 ` Nivedita Singhvi 2002-10-24 10:14 ` Roy Sigurd Karlsbakk 0 siblings, 1 reply; 36+ messages in thread From: Nivedita Singhvi @ 2002-10-23 16:34 UTC (permalink / raw) To: Roy Sigurd Karlsbakk; +Cc: bert hubert, netdev, Kernel mailing list Roy Sigurd Karlsbakk wrote: > I'm doing O_DIRECT read (from disk), so it needs to be user -> kernel, then. > > any chance of using O_DIRECT to the socket? Hmm, I'm still not clear on why you cannot use sendfile()? I was not aware of any upper limit to the file size in order for sendfile() to be used? From what little I know, this is exactly the kind of situation that sendfile was intended to benefit. thanks, Nivedita ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?) 2002-10-23 16:34 ` Nivedita Singhvi @ 2002-10-24 10:14 ` Roy Sigurd Karlsbakk 2002-10-24 10:46 ` David S. Miller 0 siblings, 1 reply; 36+ messages in thread From: Roy Sigurd Karlsbakk @ 2002-10-24 10:14 UTC (permalink / raw) To: Nivedita Singhvi; +Cc: bert hubert, netdev, Kernel mailing list > Hmm, I'm still not clear on why you cannot use sendfile()? > I was not aware of any upper limit to the file size in order > for sendfile() to be used? From what little I know, this > is exactly the kind of situation that sendfile was intended > to benefit. I can't use sendfile(). I'm working with files > 4GB, and from man 2 sendfile: ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count); int main() { ssize_t s1; off_t offset; size_t count; printf("sizeof ssize_t: %d\n", sizeof s1); printf("sizeof size_t: %d\n", sizeof count); printf("sizeof off_t: %d\n", sizeof offset); return 0; } running it $ ./sendfile_test sizeof ssize_t: 4 sizeof size_t: 4 sizeof off_t: 4 $ as far as I'm concerned, this will not allow me to address files past the 4GB limit (or was it 2?) roy -- Roy Sigurd Karlsbakk, Datavaktmester ProntoTV AS - http://www.pronto.tv/ Tel: +47 9801 3356 Computers are like air conditioners. They stop working when you open Windows. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?) 2002-10-24 10:14 ` Roy Sigurd Karlsbakk @ 2002-10-24 10:46 ` David S. Miller 0 siblings, 0 replies; 36+ messages in thread From: David S. Miller @ 2002-10-24 10:46 UTC (permalink / raw) To: Roy Sigurd Karlsbakk Cc: Nivedita Singhvi, bert hubert, netdev, Kernel mailing list On Thu, 2002-10-24 at 03:14, Roy Sigurd Karlsbakk wrote: > I can't use sendfile(). I'm working with files > 4GB, and from man 2 sendfile: That's what sendfile64() is for. In fact every vendor I am aware of is shipping the sys_sendfile64() patch in their kernels and an appropriately fixed up glibc. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 11:06 ` [RESEND] " Roy Sigurd Karlsbakk 2002-10-23 13:01 ` bert hubert @ 2002-10-23 18:01 ` Denis Vlasenko 2002-10-23 13:36 ` Roy Sigurd Karlsbakk 2002-10-23 14:52 ` [RESEND] tuning linux for high network performance? Nivedita Singhvi 1 sibling, 2 replies; 36+ messages in thread From: Denis Vlasenko @ 2002-10-23 18:01 UTC (permalink / raw) To: Roy Sigurd Karlsbakk, netdev On 23 October 2002 09:06, Roy Sigurd Karlsbakk wrote: > > I've got this video server serving video for VoD. problem is the P4 > > 1.8 seems to be maxed out by a few system calls. The below output > > is for ~50 clients streaming at ~4.5Mbps. if trying to increase > > this to ~70, the CPU maxes out. > > > > Does anyone have an idea? > > ...adding the whole profile output - sorted by the first column this > time... > > 905182 total 0.4741 > 121426 csum_partial_copy_generic 474.3203 Well, maybe take a look at this func and try to optimize it? > 93633 default_idle 1800.6346 > 74665 do_wp_page 111.1086 What's this? > 65857 ide_intr 184.9916 You have 1 ide_intr per 2 csum_partial_copy_generic... hmmm... how large is your readahead? I assume you'd like to fetch more sectors from ide per interrupt. (I hope you do DMA ;) > 53636 handle_IRQ_event 432.5484 > 21973 do_softirq 107.7108 > 20498 e1000_intr 244.0238 I know zero about networking, but why 120 000 csum_partial_copy_generic and inly 20 000 nic interrupts? That may be abnormal. -- vda ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 18:01 ` [RESEND] tuning linux for high network performance? Denis Vlasenko @ 2002-10-23 13:36 ` Roy Sigurd Karlsbakk 2002-10-24 16:22 ` Denis Vlasenko 2002-10-23 14:52 ` [RESEND] tuning linux for high network performance? Nivedita Singhvi 1 sibling, 1 reply; 36+ messages in thread From: Roy Sigurd Karlsbakk @ 2002-10-23 13:36 UTC (permalink / raw) To: vda, netdev; +Cc: Kernel mailing list > > > > 905182 total 0.4741 > > 121426 csum_partial_copy_generic 474.3203 > > Well, maybe take a look at this func and try to optimize it? I don't know assembly that good - sorry. > > 93633 default_idle 1800.6346 > > 74665 do_wp_page 111.1086 > > What's this? do_wp_page is Defined as a function in: mm/memory.c comments from the file: /* * This routine handles present pages, when users try to write * to a shared page. It is done by copying the page to a new address * and decrementing the shared-page counter for the old page. * * Goto-purists beware: the only reason for goto's here is that it results * in better assembly code.. The "default" path will see no jumps at all. * * Note that this routine assumes that the protection checks have been * done by the caller (the low-level page fault routine in most cases). * Thus we can safely just mark it writable once we've done any necessary * COW. * * We also mark the page dirty at this point even though the page will * change only once the write actually happens. This avoids a few races, * and potentially makes it more efficient. * * We hold the mm semaphore and the page_table_lock on entry and exit * with the page_table_lock released. */ > > > 65857 ide_intr 184.9916 > > You have 1 ide_intr per 2 csum_partial_copy_generic... hmmm... > how large is your readahead? I assume you'd like to fetch > more sectors from ide per interrupt. (I hope you do DMA ;) doing DMA - RAID-0 with 1MB chunk size on 4 disks. > > 53636 handle_IRQ_event 432.5484 > > 21973 do_softirq 107.7108 > > 20498 e1000_intr 244.0238 > > I know zero about networking, but why 120 000 csum_partial_copy_generic > and inly 20 000 nic interrupts? That may be abnormal. sorry I don't know -- Roy Sigurd Karlsbakk, Datavaktmester ProntoTV AS - http://www.pronto.tv/ Tel: +47 9801 3356 Computers are like air conditioners. They stop working when you open Windows. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 13:36 ` Roy Sigurd Karlsbakk @ 2002-10-24 16:22 ` Denis Vlasenko 2002-10-24 11:50 ` Russell King 0 siblings, 1 reply; 36+ messages in thread From: Denis Vlasenko @ 2002-10-24 16:22 UTC (permalink / raw) To: Roy Sigurd Karlsbakk, netdev; +Cc: Kernel mailing list On 23 October 2002 11:36, Roy Sigurd Karlsbakk wrote: > > > 905182 total 0.4741 > > > 121426 csum_partial_copy_generic 474.3203 > > > > Well, maybe take a look at this func and try to optimize it? > > I don't know assembly that good - sorry. Well, I like it. Maybe I can look into it. Feel free to bug me :-) > > > 93633 default_idle 1800.6346 > > > 74665 do_wp_page 111.1086 > > > > What's this? > > do_wp_page is Defined as a function in: mm/memory.c > > comments from the file: > [snip] Please delete memory.o, rerun make bzImage, capture gcc command used for compiling memory.c, modify it: gcc ... -o memory.o -> gcc ... -S -o memory.s ... and examine assembler code. Maybe something will stick out (or use objdump to disassemble memory.o, I recall nice option to produce assembler output with C code intermixed as comments!) (send disasmed listing to me offlist). > > > 65857 ide_intr 184.9916 > > > > You have 1 ide_intr per 2 csum_partial_copy_generic... hmmm... > > how large is your readahead? I assume you'd like to fetch > > more sectors from ide per interrupt. (I hope you do DMA ;) > > doing DMA - RAID-0 with 1MB chunk size on 4 disks. You should aim at maxing out IDE performance. Please find out how many sectors you read in one go. Maybe: # cat /proc/interrupts # dd bs=1m count=1 if=/dev/hda of=/dev/null # cat /proc/interrupts and calculate how many IDE interrupts happened. (1mb = 2048 sectors) -- vda ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-24 16:22 ` Denis Vlasenko @ 2002-10-24 11:50 ` Russell King 2002-10-24 12:42 ` bert hubert 2002-10-24 17:41 ` Denis Vlasenko 0 siblings, 2 replies; 36+ messages in thread From: Russell King @ 2002-10-24 11:50 UTC (permalink / raw) To: Denis Vlasenko; +Cc: Roy Sigurd Karlsbakk, netdev, Kernel mailing list On Thu, Oct 24, 2002 at 02:22:25PM -0200, Denis Vlasenko wrote: > Please delete memory.o, rerun make bzImage, capture gcc > command used for compiling memory.c, modify it: > > gcc ... -o memory.o -> gcc ... -S -o memory.s ... Have you tried make mm/memory.s ? -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-24 11:50 ` Russell King @ 2002-10-24 12:42 ` bert hubert 2002-10-24 17:41 ` Denis Vlasenko 1 sibling, 0 replies; 36+ messages in thread From: bert hubert @ 2002-10-24 12:42 UTC (permalink / raw) To: Denis Vlasenko, Roy Sigurd Karlsbakk, netdev On Thu, Oct 24, 2002 at 12:50:31PM +0100, Russell King wrote: > > gcc ... -o memory.o -> gcc ... -S -o memory.s ... > > Have you tried make mm/memory.s ? or even make mm/memory.lst -- http://www.PowerDNS.com Versatile DNS Software & Services http://lartc.org Linux Advanced Routing & Traffic Control HOWTO ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-24 11:50 ` Russell King 2002-10-24 12:42 ` bert hubert @ 2002-10-24 17:41 ` Denis Vlasenko 2002-10-25 11:36 ` Csum and csum copyroutines benchmark Denis Vlasenko 1 sibling, 1 reply; 36+ messages in thread From: Denis Vlasenko @ 2002-10-24 17:41 UTC (permalink / raw) To: Russell King; +Cc: Roy Sigurd Karlsbakk, netdev, Kernel mailing list On 24 October 2002 09:50, Russell King wrote: > On Thu, Oct 24, 2002 at 02:22:25PM -0200, Denis Vlasenko wrote: > > Please delete memory.o, rerun make bzImage, capture gcc > > command used for compiling memory.c, modify it: > > > > gcc ... -o memory.o -> gcc ... -S -o memory.s ... > > Have you tried make mm/memory.s ? No ;) but I have a feeling it will produce that file ;))) I'm experimenting with different csum_ routines in userspace now. -- vda ^ permalink raw reply [flat|nested] 36+ messages in thread
* Csum and csum copyroutines benchmark 2002-10-24 17:41 ` Denis Vlasenko @ 2002-10-25 11:36 ` Denis Vlasenko 2002-10-25 7:48 ` Momchil Velikov 0 siblings, 1 reply; 36+ messages in thread From: Denis Vlasenko @ 2002-10-25 11:36 UTC (permalink / raw) To: Russell King, Roy Sigurd Karlsbakk, netdev, Kernel mailing list Cc: libc-alpha [-- Attachment #1: Type: text/plain, Size: 5579 bytes --] /me said: > I'm experimenting with different csum_ routines in userspace now. Short conclusion: 1. It is possible to speed up csum routines for AMD processors by 30%. 2. It is possible to speed up csum_copy routines for both AMD and Intel three times or more. Roy, do you like that? ;) Tests: they checksum 4MB block and csum_copy 2MB into second 2MB. POISON=0/1 controls whether to perform correctness tests or not. That slows down test very noticeably. What does glibc use for memset/memcmp? for() loop?!! With POISON=1 ntqpf2_copy bugs out, see its source. I left it in to save repeating my work by others. BTW, i do NOT understand why it does not work. ;) Anyone with cluebat? IMHO the only way to make it optimal for all CPUs is to make these functions race at kernel init and pick the best one. tests on Celeron 1200 (100 MHz, x12 core) ========================================= Csum benchmark program buffer size: 4 Mb Each test tried 16 times, max and min CPU cycles are reported. Please disregard max values. They are due to system interference only. csum tests: kernel_csum - took 717 max, 704 min cycles per kb. sum=0x44000077 kernel_csum - took 4760 max, 704 min cycles per kb. sum=0x44000077 kernel_csum - took 722 max, 704 min cycles per kb. sum=0x44000077 kernelpii_csum - took 539 max, 528 min cycles per kb. sum=0x44000077 kernelpiipf_csum - took 573 max, 529 min cycles per kb. sum=0x44000077 pfm_csum - took 1411 max, 1306 min cycles per kb. sum=0x44000077 pfm2_csum - took 875 max, 762 min cycles per kb. sum=0x44000077 copy tests: kernel_copy - took 5738 max, 3423 min cycles per kb. sum=0x99aaaacc kernel_copy - took 3517 max, 3431 min cycles per kb. sum=0x99aaaacc kernel_copy - took 4385 max, 3432 min cycles per kb. sum=0x99aaaacc kernelpii_copy - took 2912 max, 2752 min cycles per kb. sum=0x99aaaacc ntqpf_copy - took 2010 max, 1700 min cycles per kb. sum=0x99aaaacc ntqpfm_copy - took 1749 max, 1701 min cycles per kb. sum=0x99aaaacc ntq_copy - took 2218 max, 2141 min cycles per kb. sum=0x99aaaacc BAD copy! <-- ntqpf2_copy is buggy :) see its source 'copy tests' above are with POISON=1 These are with POISON=0: kernel_copy - took 2009 max, 1935 min cycles per kb. sum=0x44000077 kernel_copy - took 2240 max, 1959 min cycles per kb. sum=0x44000077 kernel_copy - took 2197 max, 1936 min cycles per kb. sum=0x44000077 kernelpii_copy - took 2121 max, 1939 min cycles per kb. sum=0x44000077 ntqpf_copy - took 667 max, 548 min cycles per kb. sum=0x44000077 ntqpfm_copy - took 651 max, 546 min cycles per kb. sum=0x44000077 ntq_copy - took 660 max, 545 min cycles per kb. sum=0x44000077 ntqpf2_copy - took 644 max, 548 min cycles per kb. sum=0x44000077 Done Tests on Duron 650 (100 MHz, x6,5 core) ======================================= Csum benchmark program buffer size: 4 Mb Each test tried 16 times, max and min CPU cycles are reported. Please disregard max values. They are due to system interference only. csum tests: kernel_csum - took 1090 max, 1051 min cycles per kb. sum=0x44000077 kernel_csum - took 1080 max, 1052 min cycles per kb. sum=0x44000077 kernel_csum - took 1178 max, 1058 min cycles per kb. sum=0x44000077 kernelpii_csum - took 1614 max, 1052 min cycles per kb. sum=0x44000077 kernelpiipf_csum - took 976 max, 962 min cycles per kb. sum=0x44000077 pfm_csum - took 755 max, 746 min cycles per kb. sum=0x44000077 pfm2_csum - took 749 max, 745 min cycles per kb. sum=0x44000077 copy tests: kernel_copy - took 1251 max, 1072 min cycles per kb. sum=0x99aaaacc kernel_copy - took 1363 max, 1072 min cycles per kb. sum=0x99aaaacc kernel_copy - took 1352 max, 1072 min cycles per kb. sum=0x99aaaacc kernelpii_copy - took 1132 max, 1014 min cycles per kb. sum=0x99aaaacc ntqpf_copy - took 514 max, 480 min cycles per kb. sum=0x99aaaacc ntqpfm_copy - took 495 max, 482 min cycles per kb. sum=0x99aaaacc ntq_copy - took 1153 max, 948 min cycles per kb. sum=0x99aaaacc BAD copy! <-- ntqpf2_copy is buggy :) see its source 'copy tests' above are with POISON=1 These are with POISON=0: kernel_copy - took 1145 max, 871 min cycles per kb. sum=0x44000077 kernel_copy - took 879 max, 871 min cycles per kb. sum=0x44000077 kernel_copy - took 876 max, 871 min cycles per kb. sum=0x44000077 kernelpii_copy - took 1019 max, 845 min cycles per kb. sum=0x44000077 ntqpf_copy - took 2972 max, 229 min cycles per kb. sum=0x44000077 ntqpfm_copy - took 248 max, 245 min cycles per kb. sum=0x44000077 ntq_copy - took 460 max, 452 min cycles per kb. sum=0x44000077 ntqpf2_copy - took 390 max, 340 min cycles per kb. sum=0x44000077 Done -- vda [-- Attachment #2: timing_csum_copy.tar.bz2 --] [-- Type: application/x-bzip2, Size: 6589 bytes --] ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Csum and csum copyroutines benchmark 2002-10-25 11:36 ` Csum and csum copyroutines benchmark Denis Vlasenko @ 2002-10-25 7:48 ` Momchil Velikov 2002-10-25 13:59 ` Denis Vlasenko 0 siblings, 1 reply; 36+ messages in thread From: Momchil Velikov @ 2002-10-25 7:48 UTC (permalink / raw) To: vda Cc: Russell King, Roy Sigurd Karlsbakk, netdev, Kernel mailing list, libc-alpha >>>>> "Denis" == Denis Vlasenko <vda@port.imtp.ilyichevsk.odessa.ua> writes: Denis> /me said: >> I'm experimenting with different csum_ routines in userspace now. Denis> Short conclusion: Denis> 1. It is possible to speed up csum routines for AMD processors by 30%. Denis> 2. It is possible to speed up csum_copy routines for both AMD and Intel Denis> three times or more. Roy, do you like that? ;) Additional data point: Short summary: 1. Checksum - kernelpii_csum is ~19% faster 2. Copy - lernelpii_csum is ~6% faster Dual Pentium III, 1266Mhz, 512K cache, 2G SDRAM (133Mhz, ECC) The only changes I made were to decrease the buffer size to 1K (as I think this is more representative to a network packet size, correct me if I'm wrong) and increase the runs to 1024. Max values are worthless indeed. Csum benchmark program buffer size: 1 K Each test tried 1024 times, max and min CPU cycles are reported. Please disregard max values. They are due to system interference only. csum tests: kernel_csum - took 941 max, 740 min cycles per kb. sum=0x44000077 kernel_csum - took 748 max, 742 min cycles per kb. sum=0x44000077 kernel_csum - took 60559 max, 742 min cycles per kb. sum=0x44000077 kernelpii_csum - took 52804 max, 601 min cycles per kb. sum=0x44000077 kernelpiipf_csum - took 12930 max, 601 min cycles per kb. sum=0x44000077 pfm_csum - took 10161 max, 1402 min cycles per kb. sum=0x44000077 pfm2_csum - took 864 max, 838 min cycles per kb. sum=0x44000077 copy tests: kernel_copy - took 339 max, 239 min cycles per kb. sum=0x44000077 kernel_copy - took 239 max, 239 min cycles per kb. sum=0x44000077 kernel_copy - took 239 max, 239 min cycles per kb. sum=0x44000077 kernelpii_copy - took 244 max, 225 min cycles per kb. sum=0x44000077 ntqpf_copy - took 10867 max, 512 min cycles per kb. sum=0x44000077 ntqpfm_copy - took 710 max, 403 min cycles per kb. sum=0x44000077 ntq_copy - took 4535 max, 443 min cycles per kb. sum=0x44000077 ntqpf2_copy - took 563 max, 555 min cycles per kb. sum=0x44000077 Done HOWEVER ... sometimes (say 1/30) I get the following output: Csum benchmark program buffer size: 1 K Each test tried 1024 times, max and min CPU cycles are reported. Please disregard max values. They are due to system interference only. csum tests: kernel_csum - took 958 max, 740 min cycles per kb. sum=0x44000077 kernel_csum - took 748 max, 740 min cycles per kb. sum=0x44000077 kernel_csum - took 752 max, 740 min cycles per kb. sum=0x44000077 kernelpii_csum - took 624 max, 600 min cycles per kb. sum=0x44000077 kernelpiipf_csum - took 877211 max, 601 min cycles per kb. sum=0x44000077 Bad sum Aborted which is to say that pfm_csum and pfm2_csum results are not to be trusted (at least on PIII (or my kernel CONFIG_MPENTIUMIII=y config?)). ~velco ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Csum and csum copyroutines benchmark 2002-10-25 7:48 ` Momchil Velikov @ 2002-10-25 13:59 ` Denis Vlasenko 2002-10-25 9:47 ` Momchil Velikov 2002-10-25 10:19 ` Alan Cox 0 siblings, 2 replies; 36+ messages in thread From: Denis Vlasenko @ 2002-10-25 13:59 UTC (permalink / raw) To: Momchil Velikov Cc: Russell King, Roy Sigurd Karlsbakk, netdev, Kernel mailing list [please drop libc from CC:] On 25 October 2002 05:48, Momchil Velikov wrote: >> Short conclusion: >> 1. It is possible to speed up csum routines for AMD processors >> by 30%. >> 2. It is possible to speed up csum_copy routines for both AMD >> andd Intel three times or more. > Additional data point: > > Short summary: > 1. Checksum - kernelpii_csum is ~19% faster > 2. Copy - lernelpii_csum is ~6% faster > > Dual Pentium III, 1266Mhz, 512K cache, 2G SDRAM (133Mhz, ECC) > > The only changes I made were to decrease the buffer size to 1K (as I > think this is more representative to a network packet size, correct > me if I'm wrong) and increase the runs to 1024. Max values are > worthless indeed. Well, that makes it run entirely in L0 cache. This is unrealistic for actual use. movntq is x3 faster when you hit RAM instead of L0. You need to be more clever than that - generate pseudo-random offsets in large buffer and run on ~1K pieces of that buffer. > HOWEVER ... > > sometimes (say 1/30) I get the following output: Csum benchmark program buffer size: 1 K Each test tried 1024 times, max and min CPU cycles are reported. Please disregard max values. They are due to system interference only. csum tests: kernel_csum - took 958 max, 740 min cycles per kb. sum=0x44000077 kernel_csum - took 748 max, 740 min cycles per kb. sum=0x44000077 kernel_csum - took 752 max, 740 min cycles per kb. sum=0x44000077 kernelpii_csum - took 624 max, 600 min cycles per kb. sum=0x44000077 kernelpiipf_csum - took 877211 max, 601 min cycles per kb. sum=0x44000077 Bad sum Aborted > which is to say that pfm_csum and pfm2_csum results are not to be > trusted (at least on PIII (or my kernel CONFIG_MPENTIUMIII=y > config?)). No, it's my fault. Those routines are fast-hacked together, they are actually can csym too little. I didn't get to handle arbitrary buffer length, assuming it it a large power of two. See the source. -- vda ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Csum and csum copyroutines benchmark 2002-10-25 13:59 ` Denis Vlasenko @ 2002-10-25 9:47 ` Momchil Velikov 2002-10-25 10:19 ` Alan Cox 1 sibling, 0 replies; 36+ messages in thread From: Momchil Velikov @ 2002-10-25 9:47 UTC (permalink / raw) To: vda; +Cc: Russell King, Roy Sigurd Karlsbakk, netdev, Kernel mailing list [-- Attachment #1: Type: text/plain, Size: 2863 bytes --] >>>>> "Denis" == Denis Vlasenko <vda@port.imtp.ilyichevsk.odessa.ua> writes: Denis> [please drop libc from CC:] Denis> On 25 October 2002 05:48, Momchil Velikov wrote: >>> Short conclusion: >>> 1. It is possible to speed up csum routines for AMD processors >>> by 30%. >>> 2. It is possible to speed up csum_copy routines for both AMD >>> andd Intel three times or more. >> Additional data point: >> >> Short summary: >> 1. Checksum - kernelpii_csum is ~19% faster >> 2. Copy - lernelpii_csum is ~6% faster >> >> Dual Pentium III, 1266Mhz, 512K cache, 2G SDRAM (133Mhz, ECC) >> >> The only changes I made were to decrease the buffer size to 1K (as I >> think this is more representative to a network packet size, correct >> me if I'm wrong) and increase the runs to 1024. Max values are >> worthless indeed. Denis> Well, that makes it run entirely in L0 cache. This is unrealistic Denis> for actual use. movntq is x3 faster when you hit RAM instead of L0. Oops ... Denis> You need to be more clever than that - generate pseudo-random Denis> offsets in large buffer and run on ~1K pieces of that buffer. Here it is: Csum benchmark program buffer size: 1 K Each test tried 1024 times, max and min CPU cycles are reported. Please disregard max values. They are due to system interference only. csum tests: kernel_csum - took 8678 max, 808 min cycles per kb. sum=0x400270e8 kernel_csum - took 941 max, 808 min cycles per kb. sum=0x400270e8 kernel_csum - took 11604 max, 808 min cycles per kb. sum=0x400270e8 kernelpii_csum - took 28839 max, 664 min cycles per kb. sum=0x400270e8 kernelpiipf_csum - took 9163 max, 665 min cycles per kb. sum=0x400270e8 pfm_csum - took 2788 max, 1470 min cycles per kb. sum=0x400270e8 pfm2_csum - took 1179 max, 915 min cycles per kb. sum=0x400270e8 copy tests: kernel_copy - took 688 max, 263 min cycles per kb. sum=0x400270e8 kernel_copy - took 456 max, 263 min cycles per kb. sum=0x400270e8 kernel_copy - took 11241 max, 263 min cycles per kb. sum=0x400270e8 kernelpii_copy - took 7635 max, 246 min cycles per kb. sum=0x400270e8 ntqpf_copy - took 5349 max, 536 min cycles per kb. sum=0x400270e8 ntqpfm_copy - took 769 max, 425 min cycles per kb. sum=0x400270e8 ntq_copy - took 672 max, 469 min cycles per kb. sum=0x400270e8 ntqpf2_copy - took 8000 max, 579 min cycles per kb. sum=0x400270e8 Done Ran on a 512K (my cache size) buffer, choosing each time a 1K piece. (making the buffer larger (2M, 4M) does not make any difference). And the modified 0main.c is attached. ~velco [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0main.c --] [-- Type: text/x-csrc, Size: 3996 bytes --] #include <stdio.h> #include <stdlib.h> #define NAME(a) \ unsigned int a##csum(const unsigned char * buff, int len, \ unsigned int sum); \ unsigned int a##copy(const char *src, char *dst, \ int len, int sum, int *src_err_ptr, int *dst_err_ptr) /* This makes adding/removing test functions easier */ /* asm ones... */ NAME(kernel_); NAME(kernelpii_); NAME(kernelpiipf_); /* and C */ #include "pfm_csum.c" #include "pfm2_csum.c" #include "ntq_copy.c" #include "ntqpf_copy.c" #include "ntqpf2_copy.c" #include "ntqpfm_copy.c" const int TRY_TIMES = 1024; const int NBUFS = 512; const int BUFSIZE = 1024; const int POISON = 0; // want to check correctness? typedef unsigned int csum_func(const unsigned char * buff, int len, unsigned int sum); typedef unsigned int copy_func(const char *src, char *dst, int len, int sum, int *src_err_ptr, int *dst_err_ptr); static inline long long rdtsc() { unsigned int low,high; __asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high)); return low + (((long long)high)<<32); } int die(const char *msg) { puts(msg); abort(); return 1; } unsigned test_one_csum(csum_func *func, char *name, char *buffer) { int i; unsigned long long before,after,min,max; unsigned sum; // pick fastest run min = ~0ULL; max = 0; for (i=0;i<TRY_TIMES;i++) { before = rdtsc(); unsigned sum2 = func(buffer + (rand () % NBUFS) * BUFSIZE, BUFSIZE, 0); after = rdtsc(); if (before>after) die("timer overflow"); else { after-=before; if(min>after) min=after; if(max<after) max=after; } } printf("%32s - took %5lli max,%5lli min cycles per kb. sum=0x%08x\n", name, max / (BUFSIZE/1024), min / (BUFSIZE/1024), sum ); } unsigned test_one_copy(copy_func *func, char *name, char *buffer) { int i; unsigned long long before,after,min,max; unsigned sum; int err; // pick fastest run min = ~0ULL; max = 0; for (i=0; i<TRY_TIMES; i++) { if(POISON) memset(buffer, 0x55,BUFSIZE/2); if(POISON) memset(buffer+BUFSIZE/2,0xaa,BUFSIZE/2); buffer[0] = 0x77; buffer[BUFSIZE/2-1] = 0x44; before = rdtsc(); char *buf = buffer + rand () % (NBUFS - 1); unsigned sum2 = func(buf,buf+BUFSIZE/2,BUFSIZE/2,0,&err,&err); after = rdtsc(); if(POISON) if(memcmp(buffer,buffer+BUFSIZE/2,BUFSIZE/2)!=0) die("BAD copy!"); if (before>after) die("timer overflow"); else { after-=before; if(min>after) min=after; if(max<after) max=after; } } printf("%32s - took %5lli max,%5lli min cycles per kb. sum=0x%08x\n", name, max / (BUFSIZE/1024) / 2, min / (BUFSIZE/1024) / 2, sum ); return sum; } void test_csum(char *buffer) { unsigned sum; puts("csum tests:"); #define TEST_CSUM(a) test_one_csum(a,#a,buffer) TEST_CSUM(kernel_csum ); TEST_CSUM(kernel_csum ); TEST_CSUM(kernel_csum ); TEST_CSUM(kernelpii_csum ); TEST_CSUM(kernelpiipf_csum); TEST_CSUM(pfm_csum ); TEST_CSUM(pfm2_csum ); #undef TEST_CSUM } void test_copy(char *buffer) { unsigned sum; puts("copy tests:"); #define TEST_COPY(a) test_one_copy(a,#a,buffer) sum = TEST_COPY(kernel_copy ); sum == TEST_COPY(kernel_copy ) || die("Bad sum"); sum == TEST_COPY(kernel_copy ) || die("Bad sum"); sum == TEST_COPY(kernelpii_copy ) || die("Bad sum"); sum == TEST_COPY(ntqpf_copy ) || die("Bad sum"); sum == TEST_COPY(ntqpfm_copy ) || die("Bad sum"); sum == TEST_COPY(ntq_copy ) || die("Bad sum"); sum == TEST_COPY(ntqpf2_copy ) || die("Bad sum"); #undef TEST_COPY } int main() { char *buffer_raw,*buffer; printf("Csum benchmark program\n" "buffer size: %i K\n" "Each test tried %i times, max and min CPU cycles are reported.\n" "Please disregard max values. They are due to system interference only.\n", BUFSIZE/1024, TRY_TIMES ); buffer_raw = malloc(NBUFS * BUFSIZE+16); if(!buffer_raw) die("Malloc failed"); buffer = (char*) ((((int)buffer_raw)+15) & (~0xF)); test_csum(buffer); test_copy(buffer); puts("Done"); free(buffer_raw); return 0; } ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Csum and csum copyroutines benchmark 2002-10-25 13:59 ` Denis Vlasenko 2002-10-25 9:47 ` Momchil Velikov @ 2002-10-25 10:19 ` Alan Cox 2002-10-25 16:00 ` Denis Vlasenko 1 sibling, 1 reply; 36+ messages in thread From: Alan Cox @ 2002-10-25 10:19 UTC (permalink / raw) To: vda Cc: Momchil Velikov, Russell King, Roy Sigurd Karlsbakk, netdev, Linux Kernel Mailing List On Fri, 2002-10-25 at 14:59, Denis Vlasenko wrote: > Well, that makes it run entirely in L0 cache. This is unrealistic > for actual use. movntq is x3 faster when you hit RAM instead of L0. > > You need to be more clever than that - generate pseudo-random > offsets in large buffer and run on ~1K pieces of that buffer. In a lot of cases its extremely realistic to assume the network buffers are in cache. The copy/csum path is often touching just generated data, or data we just accessed via read(). The csum RX path from a card with DMA is probably somewhat different. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Csum and csum copyroutines benchmark 2002-10-25 10:19 ` Alan Cox @ 2002-10-25 16:00 ` Denis Vlasenko 0 siblings, 0 replies; 36+ messages in thread From: Denis Vlasenko @ 2002-10-25 16:00 UTC (permalink / raw) To: Alan Cox Cc: Momchil Velikov, Russell King, Roy Sigurd Karlsbakk, netdev, Linux Kernel Mailing List On 25 October 2002 08:19, Alan Cox wrote: > On Fri, 2002-10-25 at 14:59, Denis Vlasenko wrote: > > Well, that makes it run entirely in L0 cache. This is unrealistic > > for actual use. movntq is x3 faster when you hit RAM instead of L0. > > > > You need to be more clever than that - generate pseudo-random > > offsets in large buffer and run on ~1K pieces of that buffer. > > In a lot of cases its extremely realistic to assume the network > buffers are in cache. The copy/csum path is often touching just > generated data, or data we just accessed via read(). The csum RX path > from a card with DMA is probably somewhat different. 'Touching' is not interesting since it will pump data into cache, no matter how you 'touch' it. Running benchmarks against 1K static buffer makes cache red hot and causes _all writes_ to hit it. It may lead to wrong conclusions. Is _dst_ buffer of csum_copy going to be used by processor soon? If yes, we shouldn't use movntq, we want to cache dst. If no, we should by all means use movntq. If sometimes, then optimal strategy does not exist. :( -- vda ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RESEND] tuning linux for high network performance? 2002-10-23 18:01 ` [RESEND] tuning linux for high network performance? Denis Vlasenko 2002-10-23 13:36 ` Roy Sigurd Karlsbakk @ 2002-10-23 14:52 ` Nivedita Singhvi 1 sibling, 0 replies; 36+ messages in thread From: Nivedita Singhvi @ 2002-10-23 14:52 UTC (permalink / raw) To: vda; +Cc: Roy Sigurd Karlsbakk, netdev Denis Vlasenko wrote: > I know zero about networking, but why 120 000 csum_partial_copy_generic > and inly 20 000 nic interrupts? That may be abnormal. > -- > vda Because firstly, we pick up several packets per interrupt, and additionally, the function is also called on the send side. thanks, Nivedita ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2002-10-25 16:00 UTC | newest] Thread overview: 36+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-10-23 10:18 tuning linux for high network performance? Roy Sigurd Karlsbakk 2002-10-23 11:06 ` [RESEND] " Roy Sigurd Karlsbakk 2002-10-23 13:01 ` bert hubert 2002-10-23 13:21 ` David S. Miller 2002-10-23 13:42 ` Roy Sigurd Karlsbakk 2002-10-23 17:01 ` bert hubert 2002-10-23 17:10 ` Ben Greear 2002-10-23 17:11 ` Richard B. Johnson 2002-10-23 17:12 ` Nivedita Singhvi 2002-10-23 17:56 ` Richard B. Johnson 2002-10-23 18:07 ` Nivedita Singhvi 2002-10-23 18:30 ` Richard B. Johnson 2002-10-24 4:11 ` David S. Miller 2002-10-24 9:37 ` Karen Shaeffer 2002-10-24 10:30 ` sendfile64() anyone? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk 2002-10-24 10:47 ` David S. Miller 2002-10-24 11:07 ` Roy Sigurd Karlsbakk 2002-10-23 13:41 ` [RESEND] tuning linux for high network performance? Roy Sigurd Karlsbakk 2002-10-23 14:59 ` Nivedita Singhvi 2002-10-23 15:26 ` O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk 2002-10-23 16:34 ` Nivedita Singhvi 2002-10-24 10:14 ` Roy Sigurd Karlsbakk 2002-10-24 10:46 ` David S. Miller 2002-10-23 18:01 ` [RESEND] tuning linux for high network performance? Denis Vlasenko 2002-10-23 13:36 ` Roy Sigurd Karlsbakk 2002-10-24 16:22 ` Denis Vlasenko 2002-10-24 11:50 ` Russell King 2002-10-24 12:42 ` bert hubert 2002-10-24 17:41 ` Denis Vlasenko 2002-10-25 11:36 ` Csum and csum copyroutines benchmark Denis Vlasenko 2002-10-25 7:48 ` Momchil Velikov 2002-10-25 13:59 ` Denis Vlasenko 2002-10-25 9:47 ` Momchil Velikov 2002-10-25 10:19 ` Alan Cox 2002-10-25 16:00 ` Denis Vlasenko 2002-10-23 14:52 ` [RESEND] tuning linux for high network performance? Nivedita Singhvi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).