* Add PGM protocol support to the IP stack @ 2010-03-18 17:58 Christoph Lameter 2010-03-18 21:58 ` Christoph Lameter 2010-03-19 17:18 ` Andi Kleen 0 siblings, 2 replies; 21+ messages in thread From: Christoph Lameter @ 2010-03-18 17:58 UTC (permalink / raw) To: David Miller, netdev; +Cc: linux-kernel Is there any work in progress on including PGM support (RFC 3208) in the kernel? I know about the openpgm implementation. Openpbm does this at the user level and requires linking to a library. It is essentially a communication protocol done in user space. It has privilege issues because it has to create PGM packets via a raw socket. Which also has implications for the possible performance. Openpgm seems to be able to interact with major commercial implementations of PGM. I am looking at openpgm right now and it seems that there are a number of useful files and functions in there that could be used to implement PGM support in the kernel. There is also an existing socket API for handling PGM available in another operating system whose name we rather avoid mentioning. That socket API could be used as the basic. PGM use would then be possible without a library and without privilege and performance issues. PGM support would support two different modes of communication 1. Native PGM (allows NAK suppression by Cisco routers to be used) socket(AF_INET, SOCK_RDM, IPPROTO_RM) (SOCK_RDM is defined in the kernel sources but not implemented. PGM support would implement SOCK_RDM, IPPROTO_RM would need to be defined according to the IANA protocol number for PGM). 2. PGM over UDP (which is used by many commercial product but not by the unspeakable OS). No router support for NAK suppression is available. For this I guess we would have to support socket(AF_INET, SOCK_RDM, IPPROTO_UDP) I would be interested to find others who are interested in such a project or maybe there is already a project in the works? If not then I will try to come up with some code to get this going. Any help you could offer would be appreciated. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-18 17:58 Add PGM protocol support to the IP stack Christoph Lameter @ 2010-03-18 21:58 ` Christoph Lameter 2010-03-19 17:18 ` Andi Kleen 1 sibling, 0 replies; 21+ messages in thread From: Christoph Lameter @ 2010-03-18 21:58 UTC (permalink / raw) To: David Miller, netdev; +Cc: linux-kernel Here is what I have so far after a couple of hours. Something hacked together from openpgm and udplite. --- Documentation/networking/pgm/TODO | 8 Documentation/networking/pgm/references | 2 Documentation/networking/pgm/usage | 91 ++++ include/linux/in.h | 2 include/linux/pgm.h | 720 ++++++++++++++++++++++++++++++++ net/ipv4/Kconfig | 14 net/ipv4/Makefile | 3 net/ipv4/pgm.c | 143 ++++++ 8 files changed, 983 insertions(+) Index: linux-2.6/include/linux/in.h =================================================================== --- linux-2.6.orig/include/linux/in.h 2010-03-18 11:05:24.000000000 -0500 +++ linux-2.6/include/linux/in.h 2010-03-18 15:47:59.000000000 -0500 @@ -44,6 +44,7 @@ enum { IPPROTO_PIM = 103, /* Protocol Independent Multicast */ IPPROTO_COMP = 108, /* Compression Header protocol */ + IPPROTO_PGM = 113, /* Pragmatic General Multicast */ IPPROTO_SCTP = 132, /* Stream Control Transport Protocol */ IPPROTO_UDPLITE = 136, /* UDP-Lite (RFC 3828) */ @@ -51,6 +52,7 @@ enum { IPPROTO_MAX }; +#define IPPROTO_RM IPPROTO_PGM /* Internet address. */ struct in_addr { Index: linux-2.6/include/linux/pgm.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/include/linux/pgm.h 2010-03-18 16:56:19.000000000 -0500 @@ -0,0 +1,720 @@ +/* + * PGM packet formats, RFC 3208. + * + * Copyright (c) 2006 Miru Limited. + * Copyright (c) 2010 Christoph Lameter, The Linux Foundation. + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * March 17, 2010 Christoph Lameter + * Basic PGM definitions extracted from openpgm project. + * March 18, 2010 + * Socket API and document intended usage. + * Basic protocol environment (from udplite.c) + */ + +#ifndef _LINUX_PGM_H +#define _LINUX_PGM_H + +#include <linux/types.h> + +/* PGM socket options */ + +/* Transmitter */ +#define RM_LATEJOIN 1 /* X Not supported on receive so why have it? */ +#define RM_RATE_WINDOW_SIZE 2 /* See struct pgm_send_window */ +#define RM_SEND_WINDOW_ADV_RATE 3 /* X Increase of send window in percentage of window */ +#define RM_SENDER_STATISTICS 4 /* see struct pgm_sender_stats */ +#define RM_SENDER_WINDOW_ADVANCE_METHOD 5 /* X seems obsolete */ +#define RM_SET_MCAST_TTL 6 /* X Can be set via IP_MULTICAST_TTL */ +#define RM_SET_MESSAGE_BOUNDARY 7 /* Fix the size of the messages in bytes */ +#define RM_SET_SEND_IF 8 /* X use IP_MULTICAST_IF etc instead */ +#define RM_USE_FEC 9 + +/* Receiver */ +#define RM_ADD_RECEIVE_IF 100 /* X ???? IP_MULTICAST_IF instead? */ +#define RM_DEL_RECEIVE_IF 101 /* X IP_MULTICAST_IF */ +#define RM_HIGH_SPEED_INTRANET_OPT 102 /* X PGM should adapt automatically to high speed networks */ +#define RM_RECEIVER_STATISTICS 103 /* See struct pgm_receiver_stats */ + +/* Socket API structures (established by M$DN) */ +struct pgm_receiver_stats { + u64 NumODataPacketsReceived; /* Number of ODATA (original) sequences */ + u64 NumRDataPacketsReceived; /* Number of RDATA (repair) sequences */ + u64 NumDuplicateDataPackets; /* Duplicate sequences */ + u64 DataBytesReceived; + u64 TotalBytesReceived; + u64 RateKBitsPerSecOverall; /* Receive rate since start of session X */ + u64 RateKBitsPerSecLast; /* Receive rate for last second X*/ + u64 TrailingEdgeSeqId; /* Oldest sequence in the receive window */ + u64 LeadingEdgeSeqId; /* Newest sequence in the receive window */ + u64 AverageSequencesInWindow; /* Average number of sequences in receive window X */ + u64 MinSequencesInWindow; /* The mininum number of sequences */ + u64 MaxSequencesInWindow; /* The maximum number of sequences */ + u64 FirstNakSequenceNumber; /* First outstanding nack sequence number */ + u64 NumPendingNaks; /* Number of sequences waiting for NCF */ + u64 NumOutstandingNaks; /* Number of sequences waiting for RDATA */ + u64 NumDataPacketsBuffered; /* Number of packets currently buffered */ + u64 TotalSelectiveNaksSent; /* Number of NAKs sent total */ + u64 TotalParityNaksSent; /* Number of parity NAKs sent */ +}; + +struct pgm_sender_stats { + u64 DataBytesSent; + u64 TotalBytesSent; + u64 NaksReceived; + u64 NaksReceivedTooLate; /* NAKs received after receive window advanced */ + u64 NumOutstandingNaks; /* Number of NAKs awaiting response */ + u64 NumNaksAfterRData; /* Number of NAKs after RDATA sequences were sent which were ignored */ + u64 RepairPacketsSent; + u64 BufferSpaceAvailable; /* Number of partial messages dropped */ + u64 TrailingEdgeSeqId; /* Oldest sequence id in window */ + u64 LeadingEdgeSeqId; /* Newest sequence id in window */ + u64 RateKBitsPerSecOverall; /* Rate since start of session X */ + u64 RateKBitsPerSecLast; /* Rate in last second X */ + u64 TotalODataPacketsSent; /* Total data packets transmitted */ +}; + +/* Setup of sender RateKbitsPerSec = WindowSizeBytes / WindowSizeMSecs */ +struct pgm_send_window { + u64 RateKbitsPerSec; /* Allowed rate for the sender in kbits per second */ + u64 WindowSizeInMSecs; /* Send window size in time */ + u64 WindowSizeInBytes; /* Window size in bytes */ +}; + +struct pgm_fec_info { + u16 FECBlockSize; /* Maximum number of packets for a group. Default and max = 255 */ + u16 FECProActivePackets; /* Number of proactive packets per group. */ + u8 FECGroupSize; /* Number of packets to be treated as a group. Power of two */ + int fFECOnDemandParityEnabled; /* Allow sender to sent parity repair packets */ +}; + +/* address family indicator, rfc 1700 (ADDRESS FAMILY NUMBERS) */ +#ifndef AFI_IP +#define AFI_IP 1 /* IP (IP version 4) */ +#define AFI_IP6 2 /* IP6 (IP version 6) */ +#endif + +/* UDP ports for UDP encapsulation, as per IBM WebSphere MQ */ +#define PGM_DEFAULT_UDP_ENCAP_UCAST_PORT 3055 +#define PGM_DEFAULT_UDP_ENCAP_MCAST_PORT 3056 + +/* PGM default ports */ +#define PGM_DEFAULT_DATA_DESTINATION_PORT 7500 +#define PGM_DEFAULT_DATA_SOURCE_PORT 0 /* random */ + +/* DoS limitation to protocol (MS08-036, KB950762) */ +#define PGM_MAX_APDU UINT16_MAX + +/* Cisco default: 24 (max 8200), Juniper & H3C default: 16 */ +#define PGM_MAX_FRAGMENTS 16 + +enum pgm_type { + PGM_SPM = 0x00, /* 8.1: source path message */ + PGM_POLL = 0x01, /* 14.7.1: poll request */ + PGM_POLR = 0x02, /* 14.7.2: poll response */ + PGM_ODATA = 0x04, /* 8.2: original data */ + PGM_RDATA = 0x05, /* 8.2: repair data */ + PGM_NAK = 0x08, /* 8.3: NAK or negative acknowledgement */ + PGM_NNAK = 0x09, /* 8.3: N-NAK or null negative acknowledgement */ + PGM_NCF = 0x0a, /* 8.3: NCF or NAK confirmation */ + PGM_SPMR = 0x0c, /* 13.6: SPM request */ + PGM_MAX = 0xff +}; + +#define PGM_OPT_LENGTH 0x00 /* options length */ +#define PGM_OPT_FRAGMENT 0x01 /* fragmentation */ +#define PGM_OPT_NAK_LIST 0x02 /* list of nak entries */ +#define PGM_OPT_JOIN 0x03 /* late joining */ +#define PGM_OPT_REDIRECT 0x07 /* redirect */ +#define PGM_OPT_SYN 0x0d /* synchronisation */ +#define PGM_OPT_FIN 0x0e /* session end */ +#define PGM_OPT_RST 0x0f /* session reset */ + +#define PGM_OPT_PARITY_PRM 0x08 /* forward error correction parameters */ +#define PGM_OPT_PARITY_GRP 0x09 /* group number */ +#define PGM_OPT_CURR_TGSIZE 0x0a /* group size */ + +#define PGM_OPT_CR 0x10 /* congestion report */ +#define PGM_OPT_CRQST 0x11 /* congestion report request */ + +#define PGM_OPT_NAK_BO_IVL 0x04 /* nak back-off interval */ +#define PGM_OPT_NAK_BO_RNG 0x05 /* nak back-off range */ +#define PGM_OPT_NBR_UNREACH 0x0b /* neighbour unreachable */ +#define PGM_OPT_PATH_NLA 0x0c /* path nla */ + +#define PGM_OPT_INVALID 0x7f /* option invalidated */ + +/* 8. PGM header */ +struct pgm_header { + u16 sport; /* source port: tsi::sport or UDP port depending on direction */ + u16 dport; /* destination port */ + u8 type; /* version / packet type */ + u8 options; /* options */ +#define PGM_OPT_PARITY 0x80 /* parity packet */ +#define PGM_OPT_VAR_PKTLEN 0x40 /* + variable sized packets */ +#define PGM_OPT_NETWORK 0x02 /* network-significant: must be interpreted by network elements */ +#define PGM_OPT_PRESENT 0x01 /* option extension are present */ + u16 checksum; /* checksum */ + u8 gsi[6]; /* global source id */ + u16 tsdu_length; /* tsdu length */ + /* tpdu length = th length (header + options) + tsdu length */ +}; + +/* 8.1. Source Path Messages (SPM) */ +struct pgm_spm { + u32 sqn; /* spm sequence number */ + u32 trail; /* trailing edge sequence number */ + u32 lead; /* leading edge sequence number */ + u16 nla_afi; /* nla afi */ + u16 reserved; /* reserved */ + struct in_addr spm_nla; /* path nla */ + /* ... option extensions */ +}; + +struct pgm_spm6 { + u32 sqn; /* spm sequence number */ + u32 trail; /* trailing edge sequence number */ + u32 lead; /* leading edge sequence number */ + u16 nla_afi; /* nla afi */ + u16 reserved; /* reserved */ + struct in6_addr spm6_nla; /* path nla */ + /* ... option extensions */ +}; + +/* 8.2. Data Packet */ +struct pgm_data { + u32 sqn; /* data packet sequence number */ + u32 trail; /* trailing edge sequence number */ + /* ... option extensions */ + /* ... data */ +}; + +/* 8.3. Negative Acknowledgments and Confirmations (NAK, N-NAK, & NCF) */ +struct pgm_nak { + u32 sqn; /* requested sequence number */ + u16 src_nla_afi; /* nla afi */ + u16 reserved; /* reserved */ + struct in_addr src_nla; /* source nla */ + u16 grp_nla_afi; /* nla afi */ + u16 reserved2; /* reserved */ + struct in_addr grp_nla; /* multicast group nla */ + /* ... option extension */ +}; + +struct pgm_nak6 { + u32 sqn; /* requested sequence number */ + u16 src_nla_afi; /* nla afi */ + u16 reserved; /* reserved */ + struct in6_addr src_nla; /* source nla */ + u16 grp_nla_afi; /* nla afi */ + u16 reserved2; /* reserved */ + struct in6_addr grp_nla; /* multicast group nla */ + /* ... option extension */ +}; + +/* 9. Option header (max 16 per packet) */ +struct pgm_opt_header { + u8 type; /* option type */ +#define PGM_OPT_MASK 0x7f +#define PGM_OPT_END 0x80 /* end of options flag */ + u8 length; /* option length */ + u8 reserved; +#define PGM_OP_ENCODED 0x8 /* F-bit */ +#define PGM_OPX_MASK 0x3 +#define PGM_OPX_IGNORE 0x0 /* extensibility bits */ +#define PGM_OPX_INVALIDATE 0x1 +#define PGM_OPX_DISCARD 0x2 +#define PGM_OP_ENCODED_NULL 0x80 /* U-bit */ +}; + +/* 9.1. Option extension length - OPT_LENGTH */ +struct pgm_opt_length { + u8 type; /* include header as total length overwrites reserved/OPX bits */ + u8 length; + u16 total_length; /* total length of all options */ +}; + +/* 9.2. Option fragment - OPT_FRAGMENT */ +struct pgm_opt_fragment { + u8 reserved; /* reserved */ + u32 sqn; /* first sequence number */ + u32 frag_off; /* offset */ + u32 frag_len; /* length */ +}; + +/* 9.3.5. Option NAK List - OPT_NAK_LIST */ +struct pgm_opt_nak_list { + u8 reserved; /* reserved */ + u32 sqn[]; +}; + +/* 9.4.2. Option Join - OPT_JOIN */ +struct pgm_opt_join { + u8 reserved; /* reserved */ + u32 join_min; /* minimum sequence number */ +}; + +/* 9.5.5. Option Redirect - OPT_REDIRECT */ +struct pgm_opt_redirect { + u8 reserved; /* reserved */ + u16 nla_afi; /* nla afi */ + u16 reserved2; /* reserved */ + struct in_addr nla; /* dlr nla */ +}; + +struct pgm_opt6_redirect { + u8 reserved; /* reserved */ + u16 nla_afi; /* nla afi */ + u16 reserved2; /* reserved */ + struct in6_addr opt6_nla; /* dlr nla */ +}; + +/* 9.6.2. Option Sources - OPT_SYN */ +struct pgm_opt_syn { + u8 reserved; /* reserved */ +}; + +/* 9.7.4. Option End Session - OPT_FIN */ +struct pgm_opt_fin { + u8 reserved; /* reserved */ +}; + +/* 9.8.4. Option Reset - OPT_RST */ +struct pgm_opt_rst { + u8 reserved; /* reserved */ +}; + + +/* + * Forward Error Correction - FEC + */ + +/* 11.8.1. Option Parity - OPT_PARITY_PRM */ +struct pgm_opt_parity_prm { + u8 reserved; /* reserved */ +#define PGM_PARITY_PRM_MASK 0x3 +#define PGM_PARITY_PRM_PRO 0x1 /* source provides pro-active parity packets */ +#define PGM_PARITY_PRM_OND 0x2 /* on-demand parity packets */ + u32 tgs; /* transmission group size */ +}; + +/* 11.8.2. Option Parity Group - OPT_PARITY_GRP */ +struct pgm_opt_parity_grp { + u8 reserved; /* reserved */ + u32 group; /* parity group number */ +}; + +/* 11.8.3. Option Current Transmission Group Size - OPT_CURR_TGSIZE */ +struct pgm_opt_curr_tgsize { + u8 reserved; /* reserved */ + u32 atgsize; /* actual transmission group size */ +}; + +/* + * Congestion Control + */ + +/* 12.7.1. Option Congestion Report - OPT_CR */ +struct pgm_opt_cr { + u8 reserved; /* reserved */ + u32 cr_lead; /* congestion report reference sqn */ + u16 cr_ne_wl; /* ne worst link */ + u16 cr_ne_wp; /* ne worst path */ + u16 cr_rx_wp; /* rcvr worst path */ + u16 reserved2; /* reserved */ + u16 nla_afi; /* nla afi */ + u16 reserved3; /* reserved */ + u32 cr_rcvr; /* worst receivers nla */ +}; + +/* 12.7.2. Option Congestion Report Request - OPT_CRQST */ +struct pgm_opt_crqst { + u8 reserved; /* reserved */ +}; + + +/* + * SPM Requests + */ + +/* 13.6. SPM Requests */ +struct pgm_spmr { + /* ... option extensions */ +}; + + +/* + * Poll Mechanism + */ + +/* 14.7.1. Poll Request */ +struct pgm_poll { + u32 sqn; /* poll sequence number */ + u16 round; /* poll round */ + u16 type; /* poll sub-type */ +#define PGM_POLL_GENERAL 0x0 /* general poll */ +#define PGM_POLL_DLR 0x1 /* DLR poll */ + u16 nla_afi; /* nla afi */ + u16 reserved; /* reserved */ + struct in_addr nla; /* path nla */ + u32 bo_ivl; /* poll back-off interval */ + char rand[4]; /* random string */ + u32 mask; /* matching bit-mask */ + /* ... option extensions */ +}; + +struct pgm_poll6 { + u32 sqn; /* poll sequence number */ + u16 round; /* poll round */ + u16 s_type; /* poll sub-type */ + u16 nla_afi; /* nla afi */ + u16 reserved; /* reserved */ + struct in6_addr nla; /* path nla */ + u32 bo_ivl; /* poll back-off interval */ + char rand[4]; /* random string */ + u32 mask; /* matching bit-mask */ + /* ... option extensions */ +}; + +/* 14.7.2. Poll Response */ +struct pgm_polr { + u32 sqn; /* polr sequence number */ + u16 round; /* polr round */ + u16 reserved; /* reserved */ + /* ... option extensions */ +}; + + +/* + * Implosion Prevention + */ + +/* 15.4.1. Option NAK Back-Off Interval - OPT_NAK_BO_IVL */ +struct pgm_opt_nak_bo_ivl { + u8 opt_reserved; /* reserved */ + u32 opt_nak_bo_ivl; /* nak back-off interval */ + u32 opt_nak_bo_ivl_sqn; /* nak back-off interval sqn */ +}; + +/* 15.4.2. Option NAK Back-Off Range - OPT_NAK_BO_RNG */ +struct pgm_opt_nak_bo_rng { + u8 opt_reserved; /* reserved */ + u32 opt_nak_max_bo_ivl; /* maximum nak back-off interval */ + u32 opt_nak_min_bo_ivl; /* minimum nak back-off interval */ +}; + +/* 15.4.3. Option Neighbour Unreachable - OPT_NBR_UNREACH */ +struct pgm_opt_nbr_unreach { + u8 opt_reserved; /* reserved */ +}; + +/* 15.4.4. Option Path - OPT_PATH_NLA */ +struct pgm_opt_path_nla { + u8 reserved; /* reserved */ + struct in_addr opt_path_nla; /* path nla */ +}; + +struct pgm_opt6_path_nla { + u8 reserved; /* reserved */ + struct in6_addr opt6_path_nla; /* path nla */ +}; + +#ifdef __KERNEL__ + +#include <net/inet_sock.h> +#include <linux/skbuff.h> +#include <net/netns/hash.h> +#include <linux/rslib.h> + +static inline int pgm_is_upstream(u8 type) +{ + return (type == PGM_NAK || /* unicast */ + type == PGM_NNAK || /* unicast */ + type == PGM_SPMR || /* multicast + unicast */ + type == PGM_POLR); /* unicast */ +} + +static inline int pgm_is_peer(u8 type) +{ + return (type == PGM_SPMR); /* multicast */ +} + +static inline int pgm_is_downstream (u8 type) +{ + return (type == PGM_SPM || /* all multicast */ + type == PGM_ODATA || + type == PGM_RDATA || + type == PGM_POLL || + type == PGM_NCF); +} + +int pgm_verify_spm(struct sk_buff *); +int pgm_verify_spmr(struct sk_buff *); +int pgm_verify_nak(struct sk_buff *); +int pgm_verify_nnak(struct sk_buff *); +int pgm_verify_ncf(struct sk_buff *); +int pgm_verify_poll(struct sk_buff *); +int pgm_verify_polr(struct sk_buff *); + +/* Global sesssion ID */ +struct pgm_gsi { + char gsi[6]; +}; + +struct pgm_tsi { + char gsi[6]; /* global session identifier */ + u16 sport; /* source port: a random number to help detect session re-starts */ +} + +/* Receiver data structures */ + +enum pgm_rxw_state { + PGM_PKT_ERROR_STATE, + PGM_PKT_BACK_OFF_STATE, /* PGM protocol recovery states */ + PGM_PKT_WAIT_NCF_STATE, + PGM_PKT_WAIT_DATA_STATE, + + PGM_PKT_HAVE_DATA_STATE, /* data received waiting to commit to application layer */ + + PGM_PKT_HAVE_PARITY_STATE, /* contains parity information not original data */ + PGM_PKT_COMMIT_DATA_STATE, /* commited data waiting for purging */ + PGM_PKT_LOST_DATA_STATE, /* if recovery fails, but packet has not yet been commited */ +}; + +enum pgm_rxw_returns { + PGM_RXW_OK, + PGM_RXW_INSERTED, + PGM_RXW_APPENDED, + PGM_RXW_UPDATED, + PGM_RXW_MISSING, + PGM_RXW_DUPLICATE, + PGM_RXW_MALFORMED, + PGM_RXW_BOUNDS, + PGM_RXW_SLOW_CONSUMER, + PGM_RXW_UNKNOWN, +}; + +struct pgm_rxw_state { + unsigned long nak_rb_expiry; + unsigned long nak_rpt_expiry; + unsigned long nak_rdata_expiry; + + enum pgm_receiver_state state; + + u8 nak_transmit_count; + u8 ncf_retry_count; + u8 data_retry_count; + +/* only valid on tg_sqn::pkt_sqn = 0 */ + unsigned is_contiguous:1; /* transmission group */ +}; + +struct pgm_rxw { + struct pgm_tsi * tsi; + + struct list_head backoff_queue; + struct list_head wait_ncf_queue; + struct list_head wait_data_queue; + + /* window context counters */ + u32 lost_count; /* failed to repair */ + u32 fragment_count; /* incomplete apdu */ + u32 parity_count; /* parity for repairs */ + u32 committed_count; /* but still in window */ + + u16 max_tpdu; /* maximum packet size */ + u32 lead, trail; + u32 rxw_trail, rxw_trail_init; + u32 commit_lead; + unsigned is_constrained:1; + unsigned is_defined:1; + unsigned has_event:1; /* edge triggered */ + unsigned is_fec_available:1; + struct rs_t rs; + u32 tg_size; /* transmission group size for parity recovery */ + unsigned tg_sqn_shift; + + u32 min_fill_time; /* restricted from pgm_time_t */ + u32 max_fill_time; + u32 min_nak_transmit_count; + u32 max_nak_transmit_count; + u32 cumulative_losses; + u32 bytes_delivered; /* Fix this: Will overflow */ + u32 msgs_delivered; + + size_t size; /* in bytes */ + unsigned alloc; /* in pkts */ + struct sk_buff *pdata[]; +}; + +struct pgm_rxw* pgm_rxw_create(pgm_tsi *, u16, u32, unsigned, unsigned); +void pgm_rxw_destroy(struct pgm_rxw *); +int pgm_rxw_add(struct pgm_rxw *, struct sk_buf *, u64, u64); +void pgm_rxw_remove_commit(struct pgm_rxw *); +size_t pgm_rxw_readv(struct pgm_rxw *, struct kiovec *, unsigned int); +unsigned int pgm_rxw_remove_trail (struct pgm_rxw *); +unsigned int pgm_rxw_update(struct pgm_rxw *, u32, u32, u64, u64); +void pgm_rxw_update_fec(struct pgm_rxw *, unsigned int); +int pgm_rxw_confirm(struct pgm_rxw *, u32, u64, u64, u64); +void pgm_rxw_lost(struct pgm_rxw *, u32); +void pgm_rxw_state(struct pgm_rxw *, struct sk_buff *, enum pgm_pkt_state); +struct sk_buff *pgm_rxw_peek(struct pgm_rxw *, u32); + +static inline int pgm_rxw_max_length(struct pgm_rxw *window) +{ + return window->alloc; +} + +static inline u32 pgm_rxw_length(struct pgm_rxw *window) +{ + return ( 1 + window->lead ) - window->trail; +} + +static inline size_t pgm_rxw_size(struct pgm_rxw *window) +{ + return window->size; +} + +static inline int pgm_rxw_is_empty(struct pgm_rxw *window) +{ + return pgm_rxw_length (window) == 0; +} + +static inline int pgm_rxw_is_full(struct pgm_rxw *window) +{ + return pgm_rxw_length (window) == pgm_rxw_max_length (window); +} + +static inline u32 pgm_rxw_lead(struct pgm_rxw *window) +{ + return window->lead; +} + +static inline u32 pgm_rxw_next_lead(struct pgm_rxw *window) +{ + return pgm_rxw_lead(window) + 1; +} + +/* Transmitter data structures */ + +struct pgm_txw_state { + u32 unfolded_checksum; /* first 32-bit word must be checksum */ + + unsigned waiting_retransmit:1; /* in retransmit queue */ + unsigned retransmit_count:15; + unsigned nak_elimination_count:16; + + unsigned long expiry; /* Advance with time */ + unsigned long last_retransmit; /* NAK elimination */ +}; + +struct pgm_txw { + struct pgm_tsi* tsi; + +/* option: lockless atomics */ + u32 lead; + u32 trail; + + struct list_head retransmit_queue; + + struct rs_t rs; + unsigned int tg_sqn_shift; + struct sk_buff * parity_buffer; + unsigned is_fec_enabled:1; + + u32 size; /* window content size in bytes */ + u32 alloc; /* length of pdata[] */ + struct sk_buff* pdata[]; +}; + +struct pgm_txw *pgm_txw_create(pgm_tsi *, u16, u32, unsigned int, + unsigned int, int, unsigned int, unsigned int); +void pgm_txw_shutdown (struct pgm_txw *); +void pgm_txw_add(struct pgm_txw *, struct sk_buff *); +struct sk_buff* pgm_txw_peek(struct pgm_txw* , u32); +int pgm_txw_retransmit_push(struct pgm_txw *, u32, int, unsigned int); +struct sk_buff* pgm_txw_retransmit_try_peek(struct pgm_txw *); +void pgm_txw_retransmit_remove_head(struct pgm_txw *); + +static inline unsigned int pgm_txw_max_length(struct pgm_txw *window) +{ + return window->alloc; +} + +static inline u32 pgm_txw_length(struct pgm_txw *window) +{ + return ( 1 + window->lead ) - window->trail; +} + +static inline u32 pgm_txw_size(struct pgm_txw *window) +{ + return window->size; +} + +static inline int pgm_txw_is_empty(struct pgm_txw *window) +{ + return pgm_txw_length(window) == 0; +} + +static inline int pgm_txw_is_full(struct pgm_txw *window) +{ + return pgm_txw_length(window) == pgm_txw_max_length(window); +} + +static inline u32 pgm_txw_lead(struct pgm_txw *window) +{ + return window->lead; +} + +static inline u32 pgm_txw_next_lead(struct pgm_txw *window) +{ + return pgm_txw_lead (window) + 1; +} + +static inline u32 pgm_txw_trail(struct pgm_txw *window) +{ + return window->trail; +} + +static inline u32 pgm_txw_get_unfolded_checksum(struct sk_buff *skb) +{ + struct pgm_txw_state *state = (void *)&skb->cb; + + return state->unfolded_checksum; +} + +static inline void pgm_txw_set_unfolded_checksum(struct sk_buff* skb, u32 csum) +{ + struct pgm_txw_state *state = (void *)&skb->cb; + + state->unfolded_checksum = csum; +} + +static inline void pgm_txw_inc_retransmit_count(struct sk_buff * skb) +{ + struct pgm_txw_state *state = (void *)&skb->cb; + + state->retransmit_count++; +} + +static inline int pgm_txw_retransmit_is_empty(struct pgm_txw *window) +{ + return list_empty(&window->retransmit_queue); +} + +#endif /* __KERNEL__ */ + +#endif /* _LINUX_PGM_H */ Index: linux-2.6/Documentation/networking/pgm/TODO =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/Documentation/networking/pgm/TODO 2010-03-18 13:14:59.000000000 -0500 @@ -0,0 +1,8 @@ +- Define Socket API +- Define /proc and sys api +- Implement base logic +- PGM over UDP +- FEC Forward Error correction +- Verify interaction with Cisco and other switches +- Verify interaction with IBM Websphere, TIBCO, openpgm etc. + Index: linux-2.6/Documentation/networking/pgm/references =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/Documentation/networking/pgm/references 2010-03-18 13:14:59.000000000 -0500 @@ -0,0 +1,2 @@ +RFC3208 + Index: linux-2.6/Documentation/networking/pgm/usage =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/Documentation/networking/pgm/usage 2010-03-18 15:55:17.000000000 -0500 @@ -0,0 +1,91 @@ +1. Opening a socket + + A. Native PGM + + fd = socket(AF_INET, SOCK_RDM, IPPROTO_PGM) + + B. PGM over UDP + + fd = socket(AF_INET, SOCK_RDM, IPPROTO_UDP) + + C. PGM over SHM (?) + + fd = socket(AF_UNIX, SOCK_RDM, 0) + + +2. Binding to a multicast address + + A. Sender + + Connect the socket to a MC address and port using connect(). + + Note that the port is significant since multiple streams on different + ports can be run over the same MC addr. + + B. Receiver + + I. Bind the socket to the MC address and port of interest. + + II. Listen to the socket. + + Process will wait until a PGM packet destined to the port of interest + is received. + + III. Accept a connection. + + Establishes a session. Data can then be received. + + +3. Sending and receiving + + Use the usual socket read and write operations and the various flavors of waiting + for a packet via select, poll, epoll etc. + + Packet sizes are determined by the number of packets in a single sendmsg() unless + overridden by the RM_SET_MESSAGE_BOUNDARY socket option. + + The sender will block when the send window is full unless a non blocking write is performed. + + The receiver shows the usual wait semantics. If the stream is set to unreliable then + packets may arrive in random order. If the set is set to RM_LISTEN_ONLY then packets may + just be missing. + +4. Transmitter Socket Options + + + A. Setting the window size / rate. + + struct pgm_send_window x; + x.RateKbitsPerSec = 56; + x.WindowSizeInMsecs = 60000; + x.WindowSizeinBytes = 10000000; + + setsockopt(fd, SOCK_RDM, RM_RATE_WINDOW_SIZE, &x, sizeof(x)); + + Default is sending at 56Kbps with a buffer of 10 Megabytes and buffering for a minute. + + B. FEC mode + + struct pgm_fec_info x; + + x.FECBlocksize = 255; + x.FECProActivePackets = 0; + x.FECGroupSize = 0; + x.fFECOnDemandParityEnabled = 1; + + setsockopt(fd, SOCK_RDM, RM_FEC_MODE, &x, sizeof(x)); + + +5. Receiver Socket Options + + None? + + +Possible Extensions + + RM_UNORDERED accept unordered packet avoiding delays when packets arrive out of sequence. + packet is still NAKed. + + RM_RECEIVE_ONLY Simply ignore missed packets. Do not send any replies. + + Index: linux-2.6/net/ipv4/pgm.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/net/ipv4/pgm.c 2010-03-18 16:37:17.000000000 -0500 @@ -0,0 +1,143 @@ +/* + * PGM An implementation of the PGM (Pragmatic General Multicast) + * protocol (RFC 3208). + * + * Authors: Christoph Lameter <cl@linux-foundation.org> + * + * Changes: + * Fixes: + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#include "udp_impl.h" + +struct udp_table pgm_table __read_mostly; +EXPORT_SYMBOL(pgm_table); + +static int pgm_rcv(struct sk_buff *skb) +{ + /* TBD */ + return __udp4_lib_rcv(skb, &pgm_table, IPPROTO_UDPLITE); +} + +static void pgm_err(struct sk_buff *skb, u32 info) +{ + __udp4_lib_err(skb, info, &pgm_table); +} + +static const struct net_protocol pgm_protocol = { + .handler = pgm_rcv, + .err_handler = pgm_err, + .no_policy = 1, + .netns_ok = 1, +}; + +struct proto pgm_prot = { + .name = "PGM", + .owner = THIS_MODULE, + .close = udp_lib_close, + .connect = ip4_datagram_connect, + .disconnect = udp_disconnect, + .ioctl = udp_ioctl, + .init = pgm_sk_init, + .destroy = udp_destroy_sock, + .setsockopt = pgm_setsockopt, + .getsockopt = pgm_getsockopt, + .sendmsg = pgm_sendmsg, + .recvmsg = pgm_recvmsg, + .sendpage = pgm_sendpage, + .backlog_rcv = udp_queue_rcv_skb, + .hash = udp_lib_hash, + .unhash = udp_lib_unhash, + .get_port = udp_v4_get_port, + .obj_size = sizeof(struct udp_sock), + .slab_flags = SLAB_DESTROY_BY_RCU, + .h.udp_table = &pgm_table, +#ifdef CONFIG_COMPAT + .compat_setsockopt = compat_pgm_setsockopt, + .compat_getsockopt = compat_pgm_getsockopt, +#endif +}; + +static struct inet_protosw pgm_ip_protosw = { + .type = SOCK_RDM, + .protocol = IPPROTO_PGM, + .prot = &pgm_ip_prot, + .ops = &inet_pgm_ops, + .no_check = 0, /* must checksum (RFC 3828) */ + .flags = INET_PROTOSW_PERMANENT, +}; + +static struct inet_protosw pgm_udp_protosw = { + .type = SOCK_RDM, + .protocol = IPPROTO_UDP, + .prot = &pgm_udp_prot, + .ops = &inet_pgm_ops, + .no_check = 0, /* must checksum (RFC 3828) */ + .flags = INET_PROTOSW_PERMANENT, +}; + +#ifdef CONFIG_PROC_FS +static struct udp_seq_afinfo pgm_seq_afinfo = { + .name = "pgm", + .family = AF_INET, + .udp_table = &pgm_table, + .seq_fops = { + .owner = THIS_MODULE, + }, + .seq_ops = { + .show = udp4_seq_show, + }, +}; + +static int __net_init pgm_proc_init_net(struct net *net) +{ + return udp_proc_register(net, &pgm_seq_afinfo); +} + +static void __net_exit pgm_proc_exit_net(struct net *net) +{ + udp_proc_unregister(net, &pgm_seq_afinfo); +} + +static struct pernet_operations pgm4_net_ops = { + .init = pgm_proc_init_net, + .exit = pgm_proc_exit_net, +}; + +static __init int pgm_proc_init(void) +{ + return register_pernet_subsys(&pgm_net_ops); +} +#else +static inline int pgm_proc_init(void) +{ + return 0; +} +#endif + +void __init pgm_register(void) +{ + udp_table_init(&pgm_table, "PGM"); + if (proto_register(&pgm_prot, 1)) + goto out_register_err; + + if (inet_add_protocol(&pgm_protocol, IPPROTO_PGM) < 0) + goto out_unregister_proto; + + inet_register_protosw(&pgm_ip_protosw); + inet_register_protosw(&pgm_udp_protosw); + + if (pgm_proc_init()) + printk(KERN_ERR "%s: Cannot register /proc!\n", __func__); + return; + +out_unregister_proto: + proto_unregister(&pgm_prot); +out_register_err: + printk(KERN_CRIT "%s: Cannot add PGM protocol.\n", __func__); +} + +EXPORT_SYMBOL(pgm_prot); Index: linux-2.6/net/ipv4/Kconfig =================================================================== --- linux-2.6.orig/net/ipv4/Kconfig 2010-03-18 16:16:34.000000000 -0500 +++ linux-2.6/net/ipv4/Kconfig 2010-03-18 16:39:36.000000000 -0500 @@ -14,6 +14,20 @@ config IP_MULTICAST <file:Documentation/networking/multicast.txt>. For most people, it's safe to say N. +config IP_PGM + bool "IP: Pragmatic General Multicast (RFC3208) support" + depends on IP_MULTICAST && EXPERIMENTAL + help + This is an implementation of reliable multicasting following + RFC3208. PGM is used for publisher-subscriber based information + services on private networks. The PGM protocol allows for recovery + of lost packets through resent requests (NAKs) and through the + recovery of missing packets via FEC. PGM is supported by router + vendors through logic that allows correlation of NAKs to avoid + flooding the network with NAK (aka NAK-storm). PGM is widely used + in the financial industry and various commercial applications + support this protocol. + config IP_ADVANCED_ROUTER bool "IP: advanced router" ---help--- Index: linux-2.6/net/ipv4/Makefile =================================================================== --- linux-2.6.orig/net/ipv4/Makefile 2010-03-18 16:16:07.000000000 -0500 +++ linux-2.6/net/ipv4/Makefile 2010-03-18 16:24:04.000000000 -0500 @@ -52,3 +52,6 @@ obj-$(CONFIG_NETLABEL) += cipso_ipv4.o obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \ xfrm4_output.o + +obj-$(CONFIG_IP_PGM) += pgm.o + ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-18 17:58 Add PGM protocol support to the IP stack Christoph Lameter 2010-03-18 21:58 ` Christoph Lameter @ 2010-03-19 17:18 ` Andi Kleen 2010-03-19 21:53 ` David Miller 2010-03-22 14:20 ` Christoph Lameter 1 sibling, 2 replies; 21+ messages in thread From: Andi Kleen @ 2010-03-19 17:18 UTC (permalink / raw) To: Christoph Lameter; +Cc: David Miller, netdev, linux-kernel Christoph Lameter <cl@linux-foundation.org> writes: > > I know about the openpgm implementation. Openpbm does this at the user > level and requires linking to a library. It is essentially a communication > protocol done in user space. It has privilege issues because it has to > create PGM packets via a raw socket. That seems like a poor reason alone to put something into the kernel Perhaps you rather need some way to have unpriviledged raw sockets? The classical way to do this is to start suid root, only open the socket and then drop privileges. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-19 17:18 ` Andi Kleen @ 2010-03-19 21:53 ` David Miller 2010-03-19 22:26 ` H. Peter Anvin 2010-03-22 14:20 ` Christoph Lameter 1 sibling, 1 reply; 21+ messages in thread From: David Miller @ 2010-03-19 21:53 UTC (permalink / raw) To: andi; +Cc: cl, netdev, linux-kernel From: Andi Kleen <andi@firstfloor.org> Date: Fri, 19 Mar 2010 18:18:36 +0100 > Christoph Lameter <cl@linux-foundation.org> writes: >> >> I know about the openpgm implementation. Openpbm does this at the user >> level and requires linking to a library. It is essentially a communication >> protocol done in user space. It has privilege issues because it has to >> create PGM packets via a raw socket. > > That seems like a poor reason alone to put something into the kernel > Perhaps you rather need some way to have unpriviledged raw sockets? > > The classical way to do this is to start suid root, only open > the socket and then drop privileges. I completely agree. We should be able to make a way for unprivileged users to use RAW sockets in some limited capacity, for cases like this. But I also don't consider what openpbm has to do right now to be all that much of a restriction. You need privileges to add the protocol to the kernel, you need privileges to run the userspace variant, there is no real difference. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-19 21:53 ` David Miller @ 2010-03-19 22:26 ` H. Peter Anvin 2010-03-22 14:24 ` Christoph Lameter 0 siblings, 1 reply; 21+ messages in thread From: H. Peter Anvin @ 2010-03-19 22:26 UTC (permalink / raw) To: David Miller; +Cc: andi, cl, netdev, linux-kernel On 03/19/2010 02:53 PM, David Miller wrote: > But I also don't consider what openpbm has to do right now to > be all that much of a restriction. You need privileges to > add the protocol to the kernel, you need privileges to run > the userspace variant, there is no real difference. The real difference is if multiplex is needed between multiple unprivileged users. -hpa ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-19 22:26 ` H. Peter Anvin @ 2010-03-22 14:24 ` Christoph Lameter 0 siblings, 0 replies; 21+ messages in thread From: Christoph Lameter @ 2010-03-22 14:24 UTC (permalink / raw) To: H. Peter Anvin; +Cc: David Miller, andi, netdev, linux-kernel On Fri, 19 Mar 2010, H. Peter Anvin wrote: > On 03/19/2010 02:53 PM, David Miller wrote: > > But I also don't consider what openpbm has to do right now to > > be all that much of a restriction. You need privileges to > > add the protocol to the kernel, you need privileges to run > > the userspace variant, there is no real difference. > > The real difference is if multiplex is needed between multiple > unprivileged users. It is needed. PGM ports exist and work similarly to UDP and TCP ports. PGM as provided by openpgm and other solutions avoids native PGM and instead uses PGM over UDP. But the routers do not support PGM over UDP in the same way as native PGM. So the NAK suppression and other advanced features available in Juniper and Cisco switches cannot be used. openpbm can work with the native PGM protocol via a raw socket but then one cannot run multiple processes communicating via different ports effectively. The fragmentation of packets and the assembly etc in user space is a pain. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-19 17:18 ` Andi Kleen 2010-03-19 21:53 ` David Miller @ 2010-03-22 14:20 ` Christoph Lameter 2010-03-22 16:36 ` Andi Kleen 1 sibling, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2010-03-22 14:20 UTC (permalink / raw) To: Andi Kleen; +Cc: David Miller, netdev, linux-kernel On Fri, 19 Mar 2010, Andi Kleen wrote: > Christoph Lameter <cl@linux-foundation.org> writes: > > > > I know about the openpgm implementation. Openpbm does this at the user > > level and requires linking to a library. It is essentially a communication > > protocol done in user space. It has privilege issues because it has to > > create PGM packets via a raw socket. > > That seems like a poor reason alone to put something into the kernel > Perhaps you rather need some way to have unpriviledged raw sockets? Not the only reason. There are also performance implications. NAKing and other control messages from user space are a pain and the available implementations add numerous threads just to control the timing of control messages and the expiration of data etc. Its difficult to listen to a PGM port from user space. You have to get all messages for the PGM protocol and then filter in each process. PGM operates on the same level as TCP and UDP. > The classical way to do this is to start suid root, only open > the socket and then drop privileges. Yes those solutions exist and the experience with their limitations are the reason to try to get PGM in the kernel. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-22 14:20 ` Christoph Lameter @ 2010-03-22 16:36 ` Andi Kleen 2010-03-22 16:51 ` Christoph Lameter 0 siblings, 1 reply; 21+ messages in thread From: Andi Kleen @ 2010-03-22 16:36 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andi Kleen, David Miller, netdev, linux-kernel On Mon, Mar 22, 2010 at 09:20:42AM -0500, Christoph Lameter wrote: > On Fri, 19 Mar 2010, Andi Kleen wrote: > > > Christoph Lameter <cl@linux-foundation.org> writes: > > > > > > I know about the openpgm implementation. Openpbm does this at the user > > > level and requires linking to a library. It is essentially a communication > > > protocol done in user space. It has privilege issues because it has to > > > create PGM packets via a raw socket. > > > > That seems like a poor reason alone to put something into the kernel > > Perhaps you rather need some way to have unpriviledged raw sockets? > > Not the only reason. There are also performance implications. NAKing and > other control messages from user space are a pain and the available > implementations add numerous threads just to control the timing of control > messages and the expiration of data etc. Its difficult to listen to a PGM > port from user space. You have to get all messages for the PGM protocol > and then filter in each process. Ok that sounds like a good reason to have a kernel protocol. Thanks. Multicast reliable kernel protocols are somewhat new, I guess one would need to make sure to come up with a clean generic interface for them first. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-22 16:36 ` Andi Kleen @ 2010-03-22 16:51 ` Christoph Lameter 2010-03-22 17:43 ` Andi Kleen 0 siblings, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2010-03-22 16:51 UTC (permalink / raw) To: Andi Kleen; +Cc: David Miller, netdev, linux-kernel On Mon, 22 Mar 2010, Andi Kleen wrote: > Multicast reliable kernel protocols are somewhat new, I guess one > would need to make sure to come up with a clean generic interface > for them first. It has been around for a long time in another OS. I wonder if I should use the socket API realized there as a model or come up with something new from scratch? What I have right now is: 1. Opening a socket A. Native PGM fd = socket(AF_INET, SOCK_RDM, IPPROTO_PGM) B. PGM over UDP fd = socket(AF_INET, SOCK_RDM, IPPROTO_UDP) C. PGM over SHM (?) fd = socket(AF_UNIX, SOCK_RDM, 0) 2. Binding to a multicast address A. Sender Connect the socket to a MC address and port using connect(). Note that the port is significant since multiple streams on different ports can be run over the same MC addr. B. Receiver I. Bind the socket to the MC address and port of interest. II. Listen to the socket. Process will wait until a PGM packet destined to the port of interest is received. III. Accept a connection. Establishes a session. Data can then be received. 3. Sending and receiving Use the usual socket read and write operations and the various flavors of waiting for a packet via select, poll, epoll etc. Packet sizes are determined by the number of packets in a single sendmsg() unless overridden by the RM_SET_MESSAGE_BOUNDARY socket option. The sender will block when the send window is full unless a non blocking write is performed. The receiver shows the usual wait semantics. If the stream is set to unreliable then packets may arrive in random order. If the set is set to RM_LISTEN_ONLY then packets may just be missing. 4. Transmitter Socket Options A. Setting the window size / rate. struct pgm_send_window x; x.RateKbitsPerSec = 56; x.WindowSizeInMsecs = 60000; x.WindowSizeinBytes = 10000000; setsockopt(fd, SOCK_RDM, RM_RATE_WINDOW_SIZE, &x, sizeof(x)); Default is sending at 56Kbps with a buffer of 10 Megabytes and buffering for a minute. B. FEC mode struct pgm_fec_info x; x.FECBlocksize = 255; x.FECProActivePackets = 0; x.FECGroupSize = 0; x.fFECOnDemandParityEnabled = 1; setsockopt(fd, SOCK_RDM, RM_FEC_MODE, &x, sizeof(x)); 5. Receiver Socket Options None? Possible Extensions RM_UNORDERED accept unordered packet avoiding delays when packets arrive out of sequence. packet is still NAKed. RM_RECEIVE_ONLY Simply ignore missed packets. Do not send any replies. Existing socket options in the other OS (X denotes that this looks like its screwy and should be avoided) /* PGM socket options */ /* Transmitter */ #define RM_LATEJOIN 1 /* X Not supported on receive so why have it? */ #define RM_RATE_WINDOW_SIZE 2 /* See struct pgm_send_window */ #define RM_SEND_WINDOW_ADV_RATE 3 /* X Increase of send window in percentage of window */ #define RM_SENDER_STATISTICS 4 /* see struct pgm_sender_stats */ #define RM_SENDER_WINDOW_ADVANCE_METHOD 5 /* X seems obsolete */ #define RM_SET_MCAST_TTL 6 /* X Can be set via IP_MULTICAST_TTL */ #define RM_SET_MESSAGE_BOUNDARY 7 /* Fix the size of the messages in bytes */ #define RM_SET_SEND_IF 8 /* X use IP_MULTICAST_IF etc instead */ #define RM_USE_FEC 9 /* Receiver */ #define RM_ADD_RECEIVE_IF 100 /* X ???? IP_MULTICAST_IF instead? */ #define RM_DEL_RECEIVE_IF 101 /* X IP_MULTICAST_IF */ #define RM_HIGH_SPEED_INTRANET_OPT 102 /* X PGM should adapt automatically to high speed networks */ #define RM_RECEIVER_STATISTICS 103 /* See struct pgm_receiver_stats */ /* Socket API structures (established by M$DN) */ struct pgm_receiver_stats { u64 NumODataPacketsReceived; /* Number of ODATA (original) sequences */ u64 NumRDataPacketsReceived; /* Number of RDATA (repair) sequences */ u64 NumDuplicateDataPackets; /* Duplicate sequences */ u64 DataBytesReceived; u64 TotalBytesReceived; u64 RateKBitsPerSecOverall; /* Receive rate since start of session X */ u64 RateKBitsPerSecLast; /* Receive rate for last second X*/ u64 TrailingEdgeSeqId; /* Oldest sequence in the receive window */ u64 LeadingEdgeSeqId; /* Newest sequence in the receive window */ u64 AverageSequencesInWindow; /* Average number of sequences in receive window X */ u64 MinSequencesInWindow; /* The mininum number of sequences */ u64 MaxSequencesInWindow; /* The maximum number of sequences */ u64 FirstNakSequenceNumber; /* First outstanding nack sequence number */ u64 NumPendingNaks; /* Number of sequences waiting for NCF */ u64 NumOutstandingNaks; /* Number of sequences waiting for RDATA */ u64 NumDataPacketsBuffered; /* Number of packets currently buffered */ u64 TotalSelectiveNaksSent; /* Number of NAKs sent total */ u64 TotalParityNaksSent; /* Number of parity NAKs sent */ }; struct pgm_sender_stats { u64 DataBytesSent; u64 TotalBytesSent; u64 NaksReceived; u64 NaksReceivedTooLate; /* NAKs received after receive window advanced */ u64 NumOutstandingNaks; /* Number of NAKs awaiting response */ u64 NumNaksAfterRData; /* Number of NAKs after RDATA sequences were sent which were ignored */ u64 RepairPacketsSent; u64 BufferSpaceAvailable; /* Number of partial messages dropped */ u64 TrailingEdgeSeqId; /* Oldest sequence id in window */ u64 LeadingEdgeSeqId; /* Newest sequence id in window */ u64 RateKBitsPerSecOverall; /* Rate since start of session X */ u64 RateKBitsPerSecLast; /* Rate in last second X */ u64 TotalODataPacketsSent; /* Total data packets transmitted */ }; /* Setup of sender RateKbitsPerSec = WindowSizeBytes / WindowSizeMSecs */ struct pgm_send_window { u64 RateKbitsPerSec; /* Allowed rate for the sender in kbits per second */ u64 WindowSizeInMSecs; /* Send window size in time */ u64 WindowSizeInBytes; /* Window size in bytes */ }; struct pgm_fec_info { u16 FECBlockSize; /* Maximum number of packets for a group. Default and max = 255 */ u16 FECProActivePackets; /* Number of proactive packets per group. */ u8 FECGroupSize; /* Number of packets to be treated as a group. Power of two */ int fFECOnDemandParityEnabled; /* Allow sender to sent parity repair packets */ }; ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-22 16:51 ` Christoph Lameter @ 2010-03-22 17:43 ` Andi Kleen 2010-03-22 18:07 ` Christoph Lameter 0 siblings, 1 reply; 21+ messages in thread From: Andi Kleen @ 2010-03-22 17:43 UTC (permalink / raw) To: Christoph Lameter; +Cc: David Miller, netdev, linux-kernel Christoph Lameter <cl@linux-foundation.org> writes: > On Mon, 22 Mar 2010, Andi Kleen wrote: > >> Multicast reliable kernel protocols are somewhat new, I guess one >> would need to make sure to come up with a clean generic interface >> for them first. > > It has been around for a long time in another OS. I wonder if I should use > the socket API realized there as a model or come up with something new > from scratch? If the other API doesn't have a serious flaw I guess it's better to aim for a sub/superset at least, to make porting applications easier. > > What I have right now is: > > 1. Opening a socket > > A. Native PGM > > fd = socket(AF_INET, SOCK_RDM, IPPROTO_PGM) RDM = Reliable ? Multicast ? > B. PGM over UDP > > fd = socket(AF_INET, SOCK_RDM, IPPROTO_UDP) > > C. PGM over SHM (?) > > fd = socket(AF_UNIX, SOCK_RDM, 0) Not sure how that should work. > 3. Sending and receiving > > Use the usual socket read and write operations and the various flavors of waiting > for a packet via select, poll, epoll etc. > > Packet sizes are determined by the number of packets in a single sendmsg() unless Number of bytes surely? > overridden by the RM_SET_MESSAGE_BOUNDARY socket option. That's unusual to have such a option (except the MTU). What is it good for? > > 4. Transmitter Socket Options > > > A. Setting the window size / rate. > > struct pgm_send_window x; > x.RateKbitsPerSec = 56; > x.WindowSizeInMsecs = 60000; > x.WindowSizeinBytes = 10000000; > > setsockopt(fd, SOCK_RDM, RM_RATE_WINDOW_SIZE, &x, sizeof(x)); > > Default is sending at 56Kbps with a buffer of 10 Megabytes and buffering for a minute. That's a very large buffer for a socket. It would be better to use the usual auto shrinking/increasing mechanisms. > B. FEC mode > > struct pgm_fec_info x; > > x.FECBlocksize = 255; > x.FECProActivePackets = 0; > x.FECGroupSize = 0; > x.fFECOnDemandParityEnabled = 1; > > setsockopt(fd, SOCK_RDM, RM_FEC_MODE, &x, sizeof(x)); Is that mode really needed? > /* Socket API structures (established by M$DN) */ > struct pgm_receiver_stats { > u64 NumODataPacketsReceived; /* Number of ODATA (original) sequences */ It's difficult to maintain 64 bit counters on 32bit hosts on all targets. But I guess it would be ok to only fill in 32bit in this case. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-22 17:43 ` Andi Kleen @ 2010-03-22 18:07 ` Christoph Lameter 2010-03-22 18:53 ` Andi Kleen 0 siblings, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2010-03-22 18:07 UTC (permalink / raw) To: Andi Kleen; +Cc: David Miller, netdev, linux-kernel On Mon, 22 Mar 2010, Andi Kleen wrote: > > What I have right now is: > > > > 1. Opening a socket > > > > > A. Native PGM > > > > fd = socket(AF_INET, SOCK_RDM, IPPROTO_PGM) > > RDM = Reliable ? Multicast ? RDM is Reliable Datagram Multicast I believe. I'd rather have SOCK_PGM if I could choose. > > > B. PGM over UDP > > > > fd = socket(AF_INET, SOCK_RDM, IPPROTO_UDP) > > > > C. PGM over SHM (?) > > > > fd = socket(AF_UNIX, SOCK_RDM, 0) > > Not sure how that should work. Multiple processes would communicate via shm segments. Maybe defer to the future but its an important operation mode as the systems grow bigger and bigger. SHM segment would have to contain some sort of ring buffer that the receivers could tap into. But that mode has not really been thought through. > > 3. Sending and receiving > > > > Use the usual socket read and write operations and the various flavors of waiting > > for a packet via select, poll, epoll etc. > > > > Packet sizes are determined by the number of packets in a single sendmsg() unless > > Number of bytes surely? Sorry yes you are right. > > overridden by the RM_SET_MESSAGE_BOUNDARY socket option. > > That's unusual to have such a option (except the MTU). What is it good for? No idea why it was implemented. It can be used to use send() for portions of a message. Triggers the send() only when all bytes have been provided. Probably necessary if one wants to have very long (megabytes) messages. Esoteric and likely not going to be in a first release. > > 4. Transmitter Socket Options > > > > > > A. Setting the window size / rate. > > > > struct pgm_send_window x; > > x.RateKbitsPerSec = 56; > > x.WindowSizeInMsecs = 60000; > > x.WindowSizeinBytes = 10000000; > > > > setsockopt(fd, SOCK_RDM, RM_RATE_WINDOW_SIZE, &x, sizeof(x)); > > > > Default is sending at 56Kbps with a buffer of 10 Megabytes and buffering for a minute. > > That's a very large buffer for a socket. It would be better to use the usual > auto shrinking/increasing mechanisms. Reliable multicast protocols have a defined time period / "reliabilty buffer" so that they can resend a message that was missed for a time period. It is customary to either specify a time period or define the size of the "reliability buffer". > > B. FEC mode > > > > struct pgm_fec_info x; > > > > x.FECBlocksize = 255; > > x.FECProActivePackets = 0; > > x.FECGroupSize = 0; > > x.fFECOnDemandParityEnabled = 1; > > > > setsockopt(fd, SOCK_RDM, RM_FEC_MODE, &x, sizeof(x)); > > Is that mode really needed? Never used it. I'd rather skip for now. Maybe later. > > > /* Socket API structures (established by M$DN) */ > > struct pgm_receiver_stats { > > u64 NumODataPacketsReceived; /* Number of ODATA (original) sequences */ > > It's difficult to maintain 64 bit counters on 32bit hosts on all targets. > But I guess it would be ok to only fill in 32bit in this case. 32 bit counters have the awful habit of overflowing. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-22 18:07 ` Christoph Lameter @ 2010-03-22 18:53 ` Andi Kleen 2010-03-22 19:32 ` Christoph Lameter ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Andi Kleen @ 2010-03-22 18:53 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andi Kleen, David Miller, netdev, linux-kernel On Mon, Mar 22, 2010 at 01:07:37PM -0500, Christoph Lameter wrote: > > > B. PGM over UDP > > > > > > fd = socket(AF_INET, SOCK_RDM, IPPROTO_UDP) > > > > > > C. PGM over SHM (?) > > > > > > fd = socket(AF_UNIX, SOCK_RDM, 0) > > > > Not sure how that should work. > > Multiple processes would communicate via shm segments. Maybe defer to the > future but its an important operation mode as the systems grow bigger and bigger. > SHM segment would have to contain some sort of ring buffer that the > receivers could tap into. But that mode has not really been thought > through. AF_UNIX is not SHM today. The only point is to avoid one copy? (user1 -> kernel -> user2 to user1 -> user2) Not sure if that is really worth it. Don't you need another copy to the reliability buffer anyways? Letting kernel parse a data structure in user defined memory is also always somewhat tricky. But in principle AF_INET over localhost should not be that less efficient than AF_UNIX, so you can probably drop it for now (unless you need special AF_UNIX features like credentials) > > > > > > Packet sizes are determined by the number of packets in a single sendmsg() unless > > > > Number of bytes surely? > > Sorry yes you are right. > > > > overridden by the RM_SET_MESSAGE_BOUNDARY socket option. > > > > That's unusual to have such a option (except the MTU). What is it good for? > > No idea why it was implemented. It can be used to use send() for portions > of a message. Triggers the send() only when all bytes have been provided. > Probably necessary if one wants to have very long (megabytes) messages. Those could be a problem in kernel memory consumption. One would need to be very careful to have a good memory management scheme for the socket in place. > > > > > > A. Setting the window size / rate. > > > > > > struct pgm_send_window x; > > > x.RateKbitsPerSec = 56; > > > x.WindowSizeInMsecs = 60000; > > > x.WindowSizeinBytes = 10000000; > > > > > > setsockopt(fd, SOCK_RDM, RM_RATE_WINDOW_SIZE, &x, sizeof(x)); > > > > > > Default is sending at 56Kbps with a buffer of 10 Megabytes and buffering for a minute. > > > > That's a very large buffer for a socket. It would be better to use the usual > > auto shrinking/increasing mechanisms. > > Reliable multicast protocols have a defined time period / "reliabilty > buffer" so that they can resend a message that was missed for a time > period. It is customary to either specify a time period or define the size > of the "reliability buffer". One problem is memory management then. What happens when a process opens 100 of those sockets and fills them all? I guess you would still need a suitable global limit like TCP has. > Never used it. I'd rather skip for now. Maybe later. > > > > > > /* Socket API structures (established by M$DN) */ > > > struct pgm_receiver_stats { > > > u64 NumODataPacketsReceived; /* Number of ODATA (original) sequences */ > > > > It's difficult to maintain 64 bit counters on 32bit hosts on all targets. > > But I guess it would be ok to only fill in 32bit in this case. > > 32 bit counters have the awful habit of overflowing. There's just no portable atomic64_t. Ok maybe you can use the socket lock to synchronize all the counts if they are only per socket. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-22 18:53 ` Andi Kleen @ 2010-03-22 19:32 ` Christoph Lameter 2010-03-26 17:33 ` Christoph Lameter 2010-03-29 23:01 ` H. Peter Anvin 2 siblings, 0 replies; 21+ messages in thread From: Christoph Lameter @ 2010-03-22 19:32 UTC (permalink / raw) To: Andi Kleen; +Cc: David Miller, netdev, linux-kernel On Mon, 22 Mar 2010, Andi Kleen wrote: > > Multiple processes would communicate via shm segments. Maybe defer to the > > future but its an important operation mode as the systems grow bigger and bigger. > > SHM segment would have to contain some sort of ring buffer that the > > receivers could tap into. But that mode has not really been thought > > through. > > AF_UNIX is not SHM today. > > The only point is to avoid one copy? (user1 -> kernel -> user2 to user1 -> user2) > Not sure if that is really worth it. Don't you need another copy to the reliability > buffer anyways? Not sure either. Access of multiple processes to one reliability buffer would be best. Some sort of multiended pipe I guess. > But in principle AF_INET over localhost should not be that less efficient > than AF_UNIX, so you can probably drop it for now (unless you need special AF_UNIX > features like credentials) Well lets skip it for now and see if there are performance implications in the future. > > > That's unusual to have such a option (except the MTU). What is it good for? > > > > No idea why it was implemented. It can be used to use send() for portions > > of a message. Triggers the send() only when all bytes have been provided. > > Probably necessary if one wants to have very long (megabytes) messages. > > Those could be a problem in kernel memory consumption. One would need > to be very careful to have a good memory management scheme for the socket > in place. Lets not support it then unless someone can make a convincing case. > > Reliable multicast protocols have a defined time period / "reliabilty > > buffer" so that they can resend a message that was missed for a time > > period. It is customary to either specify a time period or define the size > > of the "reliability buffer". > > One problem is memory management then. What happens when a process opens 100 of those > sockets and fills them all? Pushes out the app? Same as the user space apps now. Some sort of upper limit is needed I guess. > I guess you would still need a suitable global limit like TCP has. Yes. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-22 18:53 ` Andi Kleen 2010-03-22 19:32 ` Christoph Lameter @ 2010-03-26 17:33 ` Christoph Lameter 2010-03-27 13:11 ` Andi Kleen 2010-03-29 23:01 ` H. Peter Anvin 2 siblings, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2010-03-26 17:33 UTC (permalink / raw) To: Andi Kleen; +Cc: David Miller, netdev, linux-kernel Here is a pgm.7 manpage describing how the socket API could look like for a PGM implementation. I dumped the RM_* based socket options from the other OS since most of the options were unusable. .\" This man page is Copyright (C) 2010 Christoph Lameter <cl@linux-foundation.org>. .\" Permission is granted to distribute possibly modified copies .\" of this page provided the header is included verbatim, .\" and in case of nontrivial modification author and date .\" of the modification is added to the header. .\" .TH PGM 7 2010-08-01 "Linux" "Linux Programmer's Manual" .SH NAME pgm \- Pragmatic General Multicast Protocol Support for IPv4 .SH SYNOPSIS .B #include <sys/socket.h> .br .B #include <netinet/in.h> .br .B #include <linux/pgm.h> .sp .B pgm_socket = socket(AF_INET, SOCK_RDM, IPPROTO_PGM); .br .B pgm_socket = socket(AF_INET, SOCK_RDM, IPPROTO_UDP); .SH DESCRIPTION This is an implementation of the Pragmatic General Multicast Protocol described in RFC\ 3028. PGM implements a connection oriented, Reliable Datagram Messaging (thus SOCK_RDM) protocol. Packets are delivered in order even though the network may have reordered, duplicated or dropped packets. Receivers may ask for retransmission of missed packets (NAK). Transmitters do not keep receiver state so that an individual sender is able to interact with an unlimited number of receivers. The recovery mechanism of PGM can limit the scalability of PGM if too many receivers are NAKing. Therefore measures exist at various layers to reduce the potential repair volume that a transmitter may have to deal with. PGM supports two variants. The first one is the .B native PGM protocol which uses its own IP protocol implementation at the same level as TCP and UDP. Native PGM supports NAK suppression ("assist") by network elements (Cisco, Juniper and other commercially available routers have support for PGM) which is an important measure to reduce the NAK volume in case of packet loss during multicast replication of messages in the network. Routers can consolidate multiple NAKs from downstream into a single upstream and are also able to use .B FEC (Forward Error Correction) to directly provide repair data without having to forward NAKs to a transmitter. The second variant is .B PGM over UDP. UDP is used as a transport protocol instead of IP. PGM over UDP does .B not support assist from network elements and therefore has limited support for NAK suppression. PGM over UDP mainly exists because of the lack of kernel based PGM implementations. Using raw sockets for packet creation and packet reception is inefficient and slow. User space based PGM implementation typically are restricted to a single stream or multiple stream in the same process since the in kernel multiplexing available for TCP and UDP does not exist. PGM over UDP allows the use of UDP port multiplexing instead which allows for] efficient operation of multiple streams on a single system even if the OS has no native support for PGM. Creation of a PGM socket will lead to an unconnected socket. A sender must connect to a multicast address to be able to send messages. A receiver needs to bind to the multicast address and port number of interest and then listen to the socket. The receiver can accept a connection when PGM traffic is received on the chosen PGM multicast address and port. It is then possible to receive datagrams on the PGM socket. When .BR connect (2) is called on the socket, the multicast destination address is set and datagrams can then be sent using .BR send (2) or .BR write (2). It is not possible to send to other destinations than the single multicast address connected to. Note that the the send operations will cause the application to be throttled if the maximum transmission rate is exceeded. Throttling can be avoided by setting the socket to non blocking mode or using MSG_DONTWAIT. In order to receive packets, the socket needs to be bound to a multicast address first by using .BR bind (2). All receive operations return only one packet. When the packet is smaller than the passed buffer, only that much data is returned; when it is bigger, the packet is truncated and the .B MSG_TRUNC flag is set. .B MSG_WAITALL is not supported. Some IP options may be sent or received using the socket options described in .BR ip (7). However, multicast join and leave operations are not supported. See .BR ip (7). By default, Linux PGM does path MTU (Maximum Transmission Unit) discovery. This means the kernel will keep track of the MTU to a specific target IP address and return .B EMSGSIZE when a PGM packet write exceeds it. When this happens, the application should decrease the packet size. Path MTU discovery can be also turned off using the .B IP_MTU_DISCOVER socket option or the .I /proc/sys/net/ipv4/ip_no_pmtu_disc file; see .BR ip (7) for details. When turned off, PGM will fragment outgoing PGM packets that exceed the interface MTU. However, disabling it is not recommended for performance and reliability reasons. .SS "Address Format" PGM supports IPv4 and IPv6 but Linux currently only supports IPv4. The .I sockaddr_in address format described in .BR ip (7) is used. .SS "Error Handling" All fatal errors will be passed to the user as an error return even when the socket is not connected. This includes asynchronous errors received from the network. You may get an error for an earlier packet that was sent on the same socket. When the .B IP_RECVERR option is enabled, all errors are stored in the socket error queue, and can be received by .BR recvmsg (2) with the .B MSG_ERRQUEUE flag set. .SS /proc interfaces System-wide PGM parameter settings can be accessed by files in the directory .IR /proc/sys/net/ipv4/ . .TP .IR pgm_mem " " This is a vector of three integers governing the number of pages allowed for queueing by all PGM sockets. .RS .TP 10 .I min Below this number of pages, PGM is not bothered about its memory appetite. When the amount of memory allocated by PGM exceeds this number, PGM starts to moderate memory usage. .TP .I pressure This value was introduced to follow the format of .IR tcp_mem (see .BR tcp (7)). .TP .I max Number of pages allowed for queueing by all PGM sockets. .RE .IP Defaults values for these three items are calculated at boot time from the amount of available memory. .TP .IR pgm_window_size_default " (integer; default value: 10 MB)" Default size, in bytes, of receive and transmit windows used by PGM sockets. Each PGM socket is able to use the size for the receiving data window, even if total pages of PGM sockets exceed pgm_mem pressure. .TP .IR pgm_window_msec_default " (integer; default value: 2000)" Default time for packets to keep in the transmit and receive windows. Each PGM socket is able to use the time period to resend data, even if total pages of PGM sockets exceed .I pgm_mem pressure. .TP .IR pgm_ambient_spm_msecs " (integer; default value 15 seconds)" Unconditional heartbeat sent by PGM transmitters to periodically notify receivers about the stream status. .TP .IR pgm_spm_list_usec " (integers; default value: 1000 1000 4000 8000 16000 32000 64000 1280000 256000 1000000 2000000 8000000) " Intervals for successive SPM heatbearts for the case that the connection goes idle. Initial SPMs are rapid to allow for fast discovery of a missed packet and then back off until the unconditional heartbeat limit is reached. .TP .IR pgm_transmitter_rate_kbps "(integer; default value: 56)" Default limit on the rate of traffic produced by a single transmitter. The rate is an overall maximum of repair and original data. The limit is set low because transmitters can do a lot of harm to the network (especially WAN links) if they sent at high rates. It it advisable to be careful when increasing the rate. .TP .IR pgm_transmitter_repair_rate_kpbs "(integer; default value 30) " Default limit on the amount of repair data sent by a single transmitter .TP .IR pgm_transmitter_nak_ignore_after_rdata_msec "(integer; default 50)" Period during which to ignore receiver NAKs after repair data was sent (is usually set to correlate to the maximum WAN delay seen). This is used to avoid useless additional repair data while NAK / repair data is in flight. .TP .IR pgm_crybaby_rate_kbps " (integer; default 20)" Maximum rate of repair traffic to a single receiver. A single receiver may be slow and not able to keep up. Therefore it may continually ask for repairs (Thus .B crybaby). This parameter allows to limit the impact that continual repair traffic by the crybaby and typically causes the crybaby to get so far out of sync that the receiver will finally have to give up since messages for which repair is needed have been expired on the transmitter side. Note that the transmitters do not keep track of the receivers. Crybaby detection is an opportunitic heuristic method. .TP .IR pgm_fec_proactive_packets " (integer; default 0 )" The number of parity packets to insert in each sequence of .B pgm_fec_group_size packets. FEC (Forward Error Correction) is another means to reduce NAK traffic in configurations with a large number of receivers. Receivers (and network elements) will be able to reconstruct missed packets on their own without resorting to NAKs. However, if too many packets are missed and recover is not possible then NAKs will still be sent. .TP .IR pgm_fec_group_size " (integer; default 16)" Defines a unit of packets for which FEC parity packets are created. .TP .IR pgm_nak_retries " (integer; default 20)" The number of recovery attempts to make for a single message before giving up. .TP .IR pgm_naks_per_sec " (integer; default 50)" The maximum number of NAKs to send per second. .IR pgm_debug " (integer; default 0)" Allows enabling diagnostics for PGM interaction on the network. If set to one then PGM will log all recovery activities/ If set to two then PGM will additionally log SPMs and SPMR and connection setup and teardown. If set to three then PGM will log all activities in the syslog. .SS "Socket Options" To set or get a PGM socket option, call .BR getsockopt (2) to read or .BR setsockopt (2) to write the option with the option level argument set to .BR IPPROTO_PGM . .TP .BR PGM_TRANSMITTER_CONFIG This option is used to set up parameters for the transmitter before connecting to a multicast address. The option cannot be used on a connected SOCK_RDM socket. It is recommended to first get the configuration data (which will contain the configured OS defaults) and then modify individual fields as needed. .sp .in +4n .nf struct pgm_transmitter_config { int rate_kbyte; /* Maximum rate per second */ int window_msecs; /* Window maximum packet age */ int window_kbytes; /* Window maximum size in kbytes */ int ambient_spm_msecs; /* Unconditional SPM */ int spm_msecs[12]; /* Idle SPM backoff */ int repeat_nak_ignore_msecs; /* How long to skip nacks after sending rdata */ int repair_rate_kbyte; /* Max permitted rate of repair traffic */ int crybaby_rate_kbyte; /* Max rate of repair traffic to individual receiver */ int transmit_only:1; /* If set do not process feedback from receivers */ int fec:1; /* Enable forward error correction */ int fec_parity:1; /* Respond to parity repair packet requests */ int fec_packets_per_group; /* Maximum number of packets for a group. */ int fec_proactive_packets; /* Number of proactive packets per group. */ int fec_group_size; /* Number of packets to be treated as a group. Power of two */ } .fi .TP .BR PGM_TRANSMITTER_STATISTICS Retrieves transmitter statistics. .sp .in +4n .nf struct pgm_transmitter_stats { u64 bytes_received; u64 data_send; u64 naks_received; u64 naks_too_late; /* NAKs received after receive window advanced */ u64 naks_outstanding; /* Number of NAKs awaiting response */ u64 naks_after_rdata; /* Number of NAKs after RDATA sequences were sent which were ignored */ u64 rdata_packets; /* Repair data */ u64 odata_packets; /* Original data */ u32 first_seqid; /* Oldest sequence id in window */ u32 last_seqid; /* Newest sequence id in window */ }; fi .TP .BR PGM_RECEIVER_CONFIG Used to setup receiver parameters before accepting a connection. The option cannot be used a on a connected SOCK_RDM socket. .sp .in +4n .nf struct pgm_receiver_config { int window_msecs; /* Receive window maximum age (per transmitter) */ int window_kbyte; /* Receive window maximum size (per transmitter) */ int nak_retries; /* Nak retries before giving up */ int nak_ncf_retries; /* Nak retries after NCF before giving up */ int nak_backoff_interval; /* time to backoff on NAK failure */ int naks_per_sec; /* Limit on the naks per second */ int peer_timeout; /* Discard peer if silent for this time period */ int spmr_timeout; /* Abort connection if no SPMR response */ int receive_only:1; /* Never send data to sender */ } .fi .TP .BR PGM_RECEIVER_STATISTICS Retrieves receiver statistics. .sp .in +4n .nf struct pgm_receiver_stats { u64 bytes_received; /* Total bytes received */ u64 data_received /* Useful data bytes received */ u64 odata_packets; /* Number of ODATA (original) sequences */ u64 rdata_packets; /* Number of RDATA (repair) sequences */ u64 odata_duplicates; /* Duplicate ODATA */ u64 rdata_duplicates; /* Duplicate RDATA */ u32 first_seqid; /* First buffered sequence id (first transmitter) */ u32 last_seqid; /* Last buffered sequence id (first transmitter) */ u32 first_naked_seqid; /* First sequence id that was naked */ u64 pending_naks; /* Outstanding naks */ u64 pending_ncfs; /* Outstanding ncfs */ u64 naks_sent; u64 parity_naks_sent; u32 active_transmitters; /* Number of transmitters */ }; .fi .SS Ioctls These ioctls can be accessed using .BR ioctl (2). The correct syntax is: .PP .RS .nf .BI int " value"; .IB error " = ioctl(" pgm_socket ", " ioctl_type ", &" value ");" .fi .RE .TP .BR FIONREAD " (" SIOCINQ ) Gets a pointer to an integer as argument. Returns the size of the next pending datagram in the integer in bytes, or 0 when no datagram is pending. .TP .BR TIOCOUTQ " (" SIOCOUTQ ) Returns the number of data bytes in the local send queue. .PP In addition all ioctls documented in .BR ip (7) and .BR socket (7) are supported. .SH ERRORS All errors documented for .BR socket (7) or .BR ip (7) may be returned by a send or receive on a PGM socket. .TP .B ECONNREFUSED The socket was not associated with a multicast address. For a receiver this may mean that no PGM traffic was detected on the given port. The address specified may not be a valid multicast address. .TP .B NOTCONN Socket is not connected. .TP .B EISCONN Socket is already connected. .TP .B ECONNABORTED Receiver was not able to keep up. Connection was torn down. .\" .SH CREDITS .\" This man page was written by Christoph Lameter. .SH "SEE ALSO" .BR ip (7), .BR raw (7), .BR socket (7), .BR udp (7) RFC\ 3028 for the Pragmatic General Multicast protocol. .br RFC\ 1122 for the host requirements. .br RFC\ 1191 for a description of path MTU discovery. .SH COLOPHON This page is part of release 3.xx of the Linux .I man-pages project. A description of the project, and information about reporting bugs, can be found at http://www.kernel.org/doc/man-pages/. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-26 17:33 ` Christoph Lameter @ 2010-03-27 13:11 ` Andi Kleen 2010-03-27 16:54 ` Martin Sustrik 2010-03-29 15:00 ` Christoph Lameter 0 siblings, 2 replies; 21+ messages in thread From: Andi Kleen @ 2010-03-27 13:11 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andi Kleen, David Miller, netdev, linux-kernel On Fri, Mar 26, 2010 at 12:33:07PM -0500, Christoph Lameter wrote: > Here is a pgm.7 manpage describing how the socket API could look like for > a PGM implementation. > > I dumped the RM_* based socket options from the other OS since most of the > options were unusable. I did a quick read and the manpage/interface seem reasonable to me. You changed the parameter struct fields to lower case. While that looks definitely more Linuxy than before does it mean programs have to #ifdef this? It might be good idea to have at least some optional compat header that #defines. -Andi ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-27 13:11 ` Andi Kleen @ 2010-03-27 16:54 ` Martin Sustrik 2010-03-29 14:50 ` Christoph Lameter 2010-03-29 15:00 ` Christoph Lameter 1 sibling, 1 reply; 21+ messages in thread From: Martin Sustrik @ 2010-03-27 16:54 UTC (permalink / raw) To: Andi Kleen; +Cc: Christoph Lameter, David Miller, netdev, linux-kernel Andi Kleen wrote: > I did a quick read and the manpage/interface seem reasonable to me. You may also have a look at original PGM implementation by Luigi Rizzo (FreeBSD). It's not maintained, but it might give you broader view. http://info.iet.unipi.it/~luigi/pgm-code/ Martin ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-27 16:54 ` Martin Sustrik @ 2010-03-29 14:50 ` Christoph Lameter 0 siblings, 0 replies; 21+ messages in thread From: Christoph Lameter @ 2010-03-29 14:50 UTC (permalink / raw) To: Martin Sustrik; +Cc: Andi Kleen, David Miller, netdev, linux-kernel On Sat, 27 Mar 2010, Martin Sustrik wrote: > Andi Kleen wrote: > > > I did a quick read and the manpage/interface seem reasonable to me. > > You may also have a look at original PGM implementation by Luigi Rizzo > (FreeBSD). It's not maintained, but it might give you broader view. > > http://info.iet.unipi.it/~luigi/pgm-code/ Interesting. Which files in that directory contain the most current code? Looks like the tcpdump patch has been merged. Here is another tcpdump patch that implements decoding PGM via UDP. Anyone know how to submit something like that? (Need to specify -Tpgm option to use pgm decoder on UDP traffic) Index: tcpdump/interface.h =================================================================== --- tcpdump.orig/interface.h 2010-02-26 18:50:39.411609391 -0600 +++ tcpdump/interface.h 2010-02-26 18:51:04.270350179 -0600 @@ -74,6 +74,7 @@ #define PT_CNFP 7 /* Cisco NetFlow protocol */ #define PT_TFTP 8 /* trivial file transfer protocol */ #define PT_AODV 9 /* Ad-hoc On-demand Distance Vector Protocol */ +#define PT_PGM 10 /* The PGM protocol */ #ifndef min #define min(a,b) ((a)>(b)?(b):(a)) Index: tcpdump/print-udp.c =================================================================== --- tcpdump.orig/print-udp.c 2010-02-26 18:51:35.921610552 -0600 +++ tcpdump/print-udp.c 2010-02-26 18:53:54.440349950 -0600 @@ -520,6 +520,11 @@ tftp_print(cp, length); break; + case PT_PGM: + udpipaddr_print(ip, sport, dport); + pgm_print(cp, length, (const u_char *)ip); + break; + case PT_AODV: udpipaddr_print(ip, sport, dport); aodv_print((const u_char *)(up + 1), length, Index: tcpdump/tcpdump.c =================================================================== --- tcpdump.orig/tcpdump.c 2010-02-26 18:37:13.971601597 -0600 +++ tcpdump/tcpdump.c 2010-02-26 18:37:43.290033748 -0600 @@ -854,6 +854,8 @@ packettype = PT_TFTP; else if (strcasecmp(optarg, "aodv") == 0) packettype = PT_AODV; + else if (strcasecmp(optarg, "pgm") == 0) + packettype = PT_PGM; else error("unknown packet type `%s'", optarg); break; ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-27 13:11 ` Andi Kleen 2010-03-27 16:54 ` Martin Sustrik @ 2010-03-29 15:00 ` Christoph Lameter 2010-03-29 21:43 ` Andi Kleen 1 sibling, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2010-03-29 15:00 UTC (permalink / raw) To: Andi Kleen; +Cc: David Miller, netdev, linux-kernel On Sat, 27 Mar 2010, Andi Kleen wrote: > On Fri, Mar 26, 2010 at 12:33:07PM -0500, Christoph Lameter wrote: > > Here is a pgm.7 manpage describing how the socket API could look like for > > a PGM implementation. > > > > I dumped the RM_* based socket options from the other OS since most of the > > options were unusable. > > I did a quick read and the manpage/interface seem reasonable to me. Thanks. I will then proceed to get a patch out that implements the network environment. Then we can plug the openpgm logic in there. > You changed the parameter struct fields to lower case. While > that looks definitely more Linuxy than before does it mean programs > have to #ifdef this? It might be good idea to have at least some > optional compat header that #defines. The socket API will be completely different. The basic handling of the sockets is the same (binding, listening, connecting). There is no way of mapping M$ socket options to Linux socket options with the approach that I proposed in the manpage. The stats structure is different too since some key elements were missing. What users are there of the M$ api? I have seen vendors supplying their own pgm implementation (guess due to bit rot in the old M$ implementation). ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-29 15:00 ` Christoph Lameter @ 2010-03-29 21:43 ` Andi Kleen 0 siblings, 0 replies; 21+ messages in thread From: Andi Kleen @ 2010-03-29 21:43 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andi Kleen, David Miller, netdev, linux-kernel On Mon, Mar 29, 2010 at 10:00:57AM -0500, Christoph Lameter wrote: > On Sat, 27 Mar 2010, Andi Kleen wrote: > > > On Fri, Mar 26, 2010 at 12:33:07PM -0500, Christoph Lameter wrote: > > > Here is a pgm.7 manpage describing how the socket API could look like for > > > a PGM implementation. > > > > > > I dumped the RM_* based socket options from the other OS since most of the > > > options were unusable. > > > > I did a quick read and the manpage/interface seem reasonable to me. > > Thanks. I will then proceed to get a patch out that implements the > network environment. Then we can plug the openpgm logic in there. You might still need some reviewing from network maintainers. > > > You changed the parameter struct fields to lower case. While > > that looks definitely more Linuxy than before does it mean programs > > have to #ifdef this? It might be good idea to have at least some > > optional compat header that #defines. > > The socket API will be completely different. The basic handling of the > sockets is the same (binding, listening, connecting). There is no way of > mapping M$ socket options to Linux socket options with the approach that > I proposed in the manpage. The stats structure is different too since some > key elements were missing. Ok. > > What users are there of the M$ api? I have seen vendors supplying their > own pgm implementation (guess due to bit rot in the old M$ > implementation). I don't know, it was just a general consideration. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-22 18:53 ` Andi Kleen 2010-03-22 19:32 ` Christoph Lameter 2010-03-26 17:33 ` Christoph Lameter @ 2010-03-29 23:01 ` H. Peter Anvin 2010-03-30 18:12 ` Christoph Lameter 2 siblings, 1 reply; 21+ messages in thread From: H. Peter Anvin @ 2010-03-29 23:01 UTC (permalink / raw) To: Andi Kleen; +Cc: Christoph Lameter, David Miller, netdev, linux-kernel On 03/22/2010 11:53 AM, Andi Kleen wrote: > > There's just no portable atomic64_t. Ok maybe you can use the socket lock > to synchronize all the counts if they are only per socket. > In 2.6.34 there is (although some arches which could support it natively don't as of yet... but that's fixable.) See lib/atomic64.c. -hpa ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack 2010-03-29 23:01 ` H. Peter Anvin @ 2010-03-30 18:12 ` Christoph Lameter 0 siblings, 0 replies; 21+ messages in thread From: Christoph Lameter @ 2010-03-30 18:12 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Andi Kleen, David Miller, netdev, linux-kernel On Mon, 29 Mar 2010, H. Peter Anvin wrote: > On 03/22/2010 11:53 AM, Andi Kleen wrote: > > > > There's just no portable atomic64_t. Ok maybe you can use the socket lock > > to synchronize all the counts if they are only per socket. > > > > In 2.6.34 there is (although some arches which could support it natively > don't as of yet... but that's fixable.) See lib/atomic64.c. There are also the 64bit thiscpu operations that were merged in 2.6.33. They do the right thing if the arch does not provide operations. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2010-03-30 18:12 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-03-18 17:58 Add PGM protocol support to the IP stack Christoph Lameter 2010-03-18 21:58 ` Christoph Lameter 2010-03-19 17:18 ` Andi Kleen 2010-03-19 21:53 ` David Miller 2010-03-19 22:26 ` H. Peter Anvin 2010-03-22 14:24 ` Christoph Lameter 2010-03-22 14:20 ` Christoph Lameter 2010-03-22 16:36 ` Andi Kleen 2010-03-22 16:51 ` Christoph Lameter 2010-03-22 17:43 ` Andi Kleen 2010-03-22 18:07 ` Christoph Lameter 2010-03-22 18:53 ` Andi Kleen 2010-03-22 19:32 ` Christoph Lameter 2010-03-26 17:33 ` Christoph Lameter 2010-03-27 13:11 ` Andi Kleen 2010-03-27 16:54 ` Martin Sustrik 2010-03-29 14:50 ` Christoph Lameter 2010-03-29 15:00 ` Christoph Lameter 2010-03-29 21:43 ` Andi Kleen 2010-03-29 23:01 ` H. Peter Anvin 2010-03-30 18:12 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).