Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6

+145

Documentation/networking/fib_trie.txt

··· 1 + LC-trie implementation notes. 2 + 3 + Node types 4 + ---------- 5 + leaf 6 + An end node with data. This has a copy of the relevant key, along 7 + with 'hlist' with routing table entries sorted by prefix length. 8 + See struct leaf and struct leaf_info. 9 + 10 + trie node or tnode 11 + An internal node, holding an array of child (leaf or tnode) pointers, 12 + indexed through a subset of the key. See Level Compression. 13 + 14 + A few concepts explained 15 + ------------------------ 16 + Bits (tnode) 17 + The number of bits in the key segment used for indexing into the 18 + child array - the "child index". See Level Compression. 19 + 20 + Pos (tnode) 21 + The position (in the key) of the key segment used for indexing into 22 + the child array. See Path Compression. 23 + 24 + Path Compression / skipped bits 25 + Any given tnode is linked to from the child array of its parent, using 26 + a segment of the key specified by the parent's "pos" and "bits" 27 + In certain cases, this tnode's own "pos" will not be immediately 28 + adjacent to the parent (pos+bits), but there will be some bits 29 + in the key skipped over because they represent a single path with no 30 + deviations. These "skipped bits" constitute Path Compression. 31 + Note that the search algorithm will simply skip over these bits when 32 + searching, making it necessary to save the keys in the leaves to 33 + verify that they actually do match the key we are searching for. 34 + 35 + Level Compression / child arrays 36 + the trie is kept level balanced moving, under certain conditions, the 37 + children of a full child (see "full_children") up one level, so that 38 + instead of a pure binary tree, each internal node ("tnode") may 39 + contain an arbitrarily large array of links to several children. 40 + Conversely, a tnode with a mostly empty child array (see empty_children) 41 + may be "halved", having some of its children moved downwards one level, 42 + in order to avoid ever-increasing child arrays. 43 + 44 + empty_children 45 + the number of positions in the child array of a given tnode that are 46 + NULL. 47 + 48 + full_children 49 + the number of children of a given tnode that aren't path compressed. 50 + (in other words, they aren't NULL or leaves and their "pos" is equal 51 + to this tnode's "pos"+"bits"). 52 + 53 + (The word "full" here is used more in the sense of "complete" than 54 + as the opposite of "empty", which might be a tad confusing.) 55 + 56 + Comments 57 + --------- 58 + 59 + We have tried to keep the structure of the code as close to fib_hash as 60 + possible to allow verification and help up reviewing. 61 + 62 + fib_find_node() 63 + A good start for understanding this code. This function implements a 64 + straightforward trie lookup. 65 + 66 + fib_insert_node() 67 + Inserts a new leaf node in the trie. This is bit more complicated than 68 + fib_find_node(). Inserting a new node means we might have to run the 69 + level compression algorithm on part of the trie. 70 + 71 + trie_leaf_remove() 72 + Looks up a key, deletes it and runs the level compression algorithm. 73 + 74 + trie_rebalance() 75 + The key function for the dynamic trie after any change in the trie 76 + it is run to optimize and reorganize. Tt will walk the trie upwards 77 + towards the root from a given tnode, doing a resize() at each step 78 + to implement level compression. 79 + 80 + resize() 81 + Analyzes a tnode and optimizes the child array size by either inflating 82 + or shrinking it repeatedly until it fullfills the criteria for optimal 83 + level compression. This part follows the original paper pretty closely 84 + and there may be some room for experimentation here. 85 + 86 + inflate() 87 + Doubles the size of the child array within a tnode. Used by resize(). 88 + 89 + halve() 90 + Halves the size of the child array within a tnode - the inverse of 91 + inflate(). Used by resize(); 92 + 93 + fn_trie_insert(), fn_trie_delete(), fn_trie_select_default() 94 + The route manipulation functions. Should conform pretty closely to the 95 + corresponding functions in fib_hash. 96 + 97 + fn_trie_flush() 98 + This walks the full trie (using nextleaf()) and searches for empty 99 + leaves which have to be removed. 100 + 101 + fn_trie_dump() 102 + Dumps the routing table ordered by prefix length. This is somewhat 103 + slower than the corresponding fib_hash function, as we have to walk the 104 + entire trie for each prefix length. In comparison, fib_hash is organized 105 + as one "zone"/hash per prefix length. 106 + 107 + Locking 108 + ------- 109 + 110 + fib_lock is used for an RW-lock in the same way that this is done in fib_hash. 111 + However, the functions are somewhat separated for other possible locking 112 + scenarios. It might conceivably be possible to run trie_rebalance via RCU 113 + to avoid read_lock in the fn_trie_lookup() function. 114 + 115 + Main lookup mechanism 116 + --------------------- 117 + fn_trie_lookup() is the main lookup function. 118 + 119 + The lookup is in its simplest form just like fib_find_node(). We descend the 120 + trie, key segment by key segment, until we find a leaf. check_leaf() does 121 + the fib_semantic_match in the leaf's sorted prefix hlist. 122 + 123 + If we find a match, we are done. 124 + 125 + If we don't find a match, we enter prefix matching mode. The prefix length, 126 + starting out at the same as the key length, is reduced one step at a time, 127 + and we backtrack upwards through the trie trying to find a longest matching 128 + prefix. The goal is always to reach a leaf and get a positive result from the 129 + fib_semantic_match mechanism. 130 + 131 + Inside each tnode, the search for longest matching prefix consists of searching 132 + through the child array, chopping off (zeroing) the least significant "1" of 133 + the child index until we find a match or the child index consists of nothing but 134 + zeros. 135 + 136 + At this point we backtrack (t->stats.backtrack++) up the trie, continuing to 137 + chop off part of the key in order to find the longest matching prefix. 138 + 139 + At this point we will repeatedly descend subtries to look for a match, and there 140 + are some optimizations available that can provide us with "shortcuts" to avoid 141 + descending into dead ends. Look for "HL_OPTIMIZE" sections in the code. 142 + 143 + To alleviate any doubts about the correctness of the route selection process, 144 + a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which 145 + gives userland access to fib_lookup().

+16 -26

drivers/net/shaper.c

··· 135 135 { 136 136 struct shaper *shaper = dev->priv; 137 137 struct sk_buff *ptr; 138 - 139 - if (down_trylock(&shaper->sem)) 140 - return -1; 141 - 138 + 139 + spin_lock(&shaper->lock); 142 140 ptr=shaper->sendq.prev; 143 141 144 142 /* ··· 230 232 shaper->stats.collisions++; 231 233 } 232 234 shaper_kick(shaper); 233 - up(&shaper->sem); 235 + spin_unlock(&shaper->lock); 234 236 return 0; 235 237 } 236 238 ··· 269 271 { 270 272 struct shaper *shaper = (struct shaper *)data; 271 273 272 - if (!down_trylock(&shaper->sem)) { 273 - shaper_kick(shaper); 274 - up(&shaper->sem); 275 - } else 276 - mod_timer(&shaper->timer, jiffies); 274 + spin_lock(&shaper->lock); 275 + shaper_kick(shaper); 276 + spin_unlock(&shaper->lock); 277 277 } 278 278 279 279 /* ··· 328 332 329 333 330 334 /* 331 - * Flush the shaper queues on a closedown 332 - */ 333 - 334 - static void shaper_flush(struct shaper *shaper) 335 - { 336 - struct sk_buff *skb; 337 - 338 - down(&shaper->sem); 339 - while((skb=skb_dequeue(&shaper->sendq))!=NULL) 340 - dev_kfree_skb(skb); 341 - shaper_kick(shaper); 342 - up(&shaper->sem); 343 - } 344 - 345 - /* 346 335 * Bring the interface up. We just disallow this until a 347 336 * bind. 348 337 */ ··· 356 375 static int shaper_close(struct net_device *dev) 357 376 { 358 377 struct shaper *shaper=dev->priv; 359 - shaper_flush(shaper); 378 + struct sk_buff *skb; 379 + 380 + while ((skb = skb_dequeue(&shaper->sendq)) != NULL) 381 + dev_kfree_skb(skb); 382 + 383 + spin_lock_bh(&shaper->lock); 384 + shaper_kick(shaper); 385 + spin_unlock_bh(&shaper->lock); 386 + 360 387 del_timer_sync(&shaper->timer); 361 388 return 0; 362 389 } ··· 565 576 init_timer(&sh->timer); 566 577 sh->timer.function=shaper_timer; 567 578 sh->timer.data=(unsigned long)sh; 579 + spin_lock_init(&sh->lock); 568 580 } 569 581 570 582 /*

+1

drivers/net/skge.h

··· 7 7 /* PCI config registers */ 8 8 #define PCI_DEV_REG1 0x40 9 9 #define PCI_DEV_REG2 0x44 10 + #define PCI_REV_DESC 0x4 10 11 11 12 #define PCI_STATUS_ERROR_BITS (PCI_STATUS_DETECTED_PARITY | \ 12 13 PCI_STATUS_SIG_SYSTEM_ERROR | \

+65 -4

drivers/net/tg3.c

··· 66 66 67 67 #define DRV_MODULE_NAME "tg3" 68 68 #define PFX DRV_MODULE_NAME ": " 69 - #define DRV_MODULE_VERSION "3.32" 70 - #define DRV_MODULE_RELDATE "June 24, 2005" 69 + #define DRV_MODULE_VERSION "3.33" 70 + #define DRV_MODULE_RELDATE "July 5, 2005" 71 71 72 72 #define TG3_DEF_MAC_MODE 0 73 73 #define TG3_DEF_RX_MODE 0 ··· 5117 5117 } 5118 5118 5119 5119 static void __tg3_set_rx_mode(struct net_device *); 5120 - static void tg3_set_coalesce(struct tg3 *tp, struct ethtool_coalesce *ec) 5120 + static void __tg3_set_coalesce(struct tg3 *tp, struct ethtool_coalesce *ec) 5121 5121 { 5122 5122 tw32(HOSTCC_RXCOL_TICKS, ec->rx_coalesce_usecs); 5123 5123 tw32(HOSTCC_TXCOL_TICKS, ec->tx_coalesce_usecs); ··· 5460 5460 udelay(10); 5461 5461 } 5462 5462 5463 - tg3_set_coalesce(tp, &tp->coal); 5463 + __tg3_set_coalesce(tp, &tp->coal); 5464 5464 5465 5465 /* set status block DMA address */ 5466 5466 tw32(HOSTCC_STATUS_BLK_HOST_ADDR + TG3_64BIT_REG_HIGH, ··· 7821 7821 return 0; 7822 7822 } 7823 7823 7824 + static int tg3_set_coalesce(struct net_device *dev, struct ethtool_coalesce *ec) 7825 + { 7826 + struct tg3 *tp = netdev_priv(dev); 7827 + u32 max_rxcoal_tick_int = 0, max_txcoal_tick_int = 0; 7828 + u32 max_stat_coal_ticks = 0, min_stat_coal_ticks = 0; 7829 + 7830 + if (!(tp->tg3_flags2 & TG3_FLG2_5705_PLUS)) { 7831 + max_rxcoal_tick_int = MAX_RXCOAL_TICK_INT; 7832 + max_txcoal_tick_int = MAX_TXCOAL_TICK_INT; 7833 + max_stat_coal_ticks = MAX_STAT_COAL_TICKS; 7834 + min_stat_coal_ticks = MIN_STAT_COAL_TICKS; 7835 + } 7836 + 7837 + if ((ec->rx_coalesce_usecs > MAX_RXCOL_TICKS) || 7838 + (ec->tx_coalesce_usecs > MAX_TXCOL_TICKS) || 7839 + (ec->rx_max_coalesced_frames > MAX_RXMAX_FRAMES) || 7840 + (ec->tx_max_coalesced_frames > MAX_TXMAX_FRAMES) || 7841 + (ec->rx_coalesce_usecs_irq > max_rxcoal_tick_int) || 7842 + (ec->tx_coalesce_usecs_irq > max_txcoal_tick_int) || 7843 + (ec->rx_max_coalesced_frames_irq > MAX_RXCOAL_MAXF_INT) || 7844 + (ec->tx_max_coalesced_frames_irq > MAX_TXCOAL_MAXF_INT) || 7845 + (ec->stats_block_coalesce_usecs > max_stat_coal_ticks) || 7846 + (ec->stats_block_coalesce_usecs < min_stat_coal_ticks)) 7847 + return -EINVAL; 7848 + 7849 + /* No rx interrupts will be generated if both are zero */ 7850 + if ((ec->rx_coalesce_usecs == 0) && 7851 + (ec->rx_max_coalesced_frames == 0)) 7852 + return -EINVAL; 7853 + 7854 + /* No tx interrupts will be generated if both are zero */ 7855 + if ((ec->tx_coalesce_usecs == 0) && 7856 + (ec->tx_max_coalesced_frames == 0)) 7857 + return -EINVAL; 7858 + 7859 + /* Only copy relevant parameters, ignore all others. */ 7860 + tp->coal.rx_coalesce_usecs = ec->rx_coalesce_usecs; 7861 + tp->coal.tx_coalesce_usecs = ec->tx_coalesce_usecs; 7862 + tp->coal.rx_max_coalesced_frames = ec->rx_max_coalesced_frames; 7863 + tp->coal.tx_max_coalesced_frames = ec->tx_max_coalesced_frames; 7864 + tp->coal.rx_coalesce_usecs_irq = ec->rx_coalesce_usecs_irq; 7865 + tp->coal.tx_coalesce_usecs_irq = ec->tx_coalesce_usecs_irq; 7866 + tp->coal.rx_max_coalesced_frames_irq = ec->rx_max_coalesced_frames_irq; 7867 + tp->coal.tx_max_coalesced_frames_irq = ec->tx_max_coalesced_frames_irq; 7868 + tp->coal.stats_block_coalesce_usecs = ec->stats_block_coalesce_usecs; 7869 + 7870 + if (netif_running(dev)) { 7871 + tg3_full_lock(tp, 0); 7872 + __tg3_set_coalesce(tp, &tp->coal); 7873 + tg3_full_unlock(tp); 7874 + } 7875 + return 0; 7876 + } 7877 + 7824 7878 static struct ethtool_ops tg3_ethtool_ops = { 7825 7879 .get_settings = tg3_get_settings, 7826 7880 .set_settings = tg3_set_settings, ··· 7910 7856 .get_stats_count = tg3_get_stats_count, 7911 7857 .get_ethtool_stats = tg3_get_ethtool_stats, 7912 7858 .get_coalesce = tg3_get_coalesce, 7859 + .set_coalesce = tg3_set_coalesce, 7913 7860 }; 7914 7861 7915 7862 static void __devinit tg3_get_eeprom_size(struct tg3 *tp) ··· 9854 9799 ec->rx_coalesce_usecs_irq = DEFAULT_RXCOAL_TICK_INT_CLRTCKS; 9855 9800 ec->tx_coalesce_usecs = LOW_TXCOL_TICKS_CLRTCKS; 9856 9801 ec->tx_coalesce_usecs_irq = DEFAULT_TXCOAL_TICK_INT_CLRTCKS; 9802 + } 9803 + 9804 + if (tp->tg3_flags2 & TG3_FLG2_5705_PLUS) { 9805 + ec->rx_coalesce_usecs_irq = 0; 9806 + ec->tx_coalesce_usecs_irq = 0; 9807 + ec->stats_block_coalesce_usecs = 0; 9857 9808 } 9858 9809 } 9859 9810

+10

drivers/net/tg3.h

··· 879 879 #define LOW_RXCOL_TICKS_CLRTCKS 0x00000014 880 880 #define DEFAULT_RXCOL_TICKS 0x00000048 881 881 #define HIGH_RXCOL_TICKS 0x00000096 882 + #define MAX_RXCOL_TICKS 0x000003ff 882 883 #define HOSTCC_TXCOL_TICKS 0x00003c0c 883 884 #define LOW_TXCOL_TICKS 0x00000096 884 885 #define LOW_TXCOL_TICKS_CLRTCKS 0x00000048 885 886 #define DEFAULT_TXCOL_TICKS 0x0000012c 886 887 #define HIGH_TXCOL_TICKS 0x00000145 888 + #define MAX_TXCOL_TICKS 0x000003ff 887 889 #define HOSTCC_RXMAX_FRAMES 0x00003c10 888 890 #define LOW_RXMAX_FRAMES 0x00000005 889 891 #define DEFAULT_RXMAX_FRAMES 0x00000008 890 892 #define HIGH_RXMAX_FRAMES 0x00000012 893 + #define MAX_RXMAX_FRAMES 0x000000ff 891 894 #define HOSTCC_TXMAX_FRAMES 0x00003c14 892 895 #define LOW_TXMAX_FRAMES 0x00000035 893 896 #define DEFAULT_TXMAX_FRAMES 0x0000004b 894 897 #define HIGH_TXMAX_FRAMES 0x00000052 898 + #define MAX_TXMAX_FRAMES 0x000000ff 895 899 #define HOSTCC_RXCOAL_TICK_INT 0x00003c18 896 900 #define DEFAULT_RXCOAL_TICK_INT 0x00000019 897 901 #define DEFAULT_RXCOAL_TICK_INT_CLRTCKS 0x00000014 902 + #define MAX_RXCOAL_TICK_INT 0x000003ff 898 903 #define HOSTCC_TXCOAL_TICK_INT 0x00003c1c 899 904 #define DEFAULT_TXCOAL_TICK_INT 0x00000019 900 905 #define DEFAULT_TXCOAL_TICK_INT_CLRTCKS 0x00000014 906 + #define MAX_TXCOAL_TICK_INT 0x000003ff 901 907 #define HOSTCC_RXCOAL_MAXF_INT 0x00003c20 902 908 #define DEFAULT_RXCOAL_MAXF_INT 0x00000005 909 + #define MAX_RXCOAL_MAXF_INT 0x000000ff 903 910 #define HOSTCC_TXCOAL_MAXF_INT 0x00003c24 904 911 #define DEFAULT_TXCOAL_MAXF_INT 0x00000005 912 + #define MAX_TXCOAL_MAXF_INT 0x000000ff 905 913 #define HOSTCC_STAT_COAL_TICKS 0x00003c28 906 914 #define DEFAULT_STAT_COAL_TICKS 0x000f4240 915 + #define MAX_STAT_COAL_TICKS 0xd693d400 916 + #define MIN_STAT_COAL_TICKS 0x00000064 907 917 /* 0x3c2c --> 0x3c30 unused */ 908 918 #define HOSTCC_STATS_BLK_HOST_ADDR 0x00003c30 /* 64-bit */ 909 919 #define HOSTCC_STATUS_BLK_HOST_ADDR 0x00003c38 /* 64-bit */

+1 -1

include/linux/if_shaper.h

··· 23 23 __u32 shapeclock; 24 24 unsigned long recovery; /* Time we can next clock a packet out on 25 25 an empty queue */ 26 - struct semaphore sem; 26 + spinlock_t lock; 27 27 struct net_device_stats stats; 28 28 struct net_device *dev; 29 29 int (*hard_start_xmit) (struct sk_buff *skb,

+9 -10

include/linux/skbuff.h

··· 183 183 * @priority: Packet queueing priority 184 184 * @users: User count - see {datagram,tcp}.c 185 185 * @protocol: Packet protocol from driver 186 - * @security: Security level of packet 187 186 * @truesize: Buffer size 188 187 * @head: Head of buffer 189 188 * @data: Data head pointer ··· 248 249 data_len, 249 250 mac_len, 250 251 csum; 251 - unsigned char local_df, 252 - cloned:1, 253 - nohdr:1, 254 - pkt_type, 255 - ip_summed; 256 252 __u32 priority; 257 - unsigned short protocol, 258 - security; 253 + __u8 local_df:1, 254 + cloned:1, 255 + ip_summed:2, 256 + nohdr:1; 257 + /* 3 bits spare */ 258 + __u8 pkt_type; 259 + __u16 protocol; 259 260 260 261 void (*destructor)(struct sk_buff *skb); 261 262 #ifdef CONFIG_NETFILTER 262 - unsigned long nfmark; 263 + unsigned long nfmark; 263 264 __u32 nfcache; 264 265 __u32 nfctinfo; 265 266 struct nf_conntrack *nfct; ··· 1210 1211 { 1211 1212 int hlen = skb_headlen(skb); 1212 1213 1213 - if (offset + len <= hlen) 1214 + if (hlen - offset >= len) 1214 1215 return skb->data + offset; 1215 1216 1216 1217 if (skb_copy_bits(skb, offset, buffer, len) < 0)

+1 -1

include/linux/tc_ematch/tc_em_meta.h

··· 45 45 TCF_META_ID_REALDEV, 46 46 TCF_META_ID_PRIORITY, 47 47 TCF_META_ID_PROTOCOL, 48 - TCF_META_ID_SECURITY, 48 + TCF_META_ID_SECURITY, /* obsolete */ 49 49 TCF_META_ID_PKTTYPE, 50 50 TCF_META_ID_PKTLEN, 51 51 TCF_META_ID_DATALEN,

+1 -1

include/linux/tcp.h

··· 286 286 __u32 max_window; /* Maximal window ever seen from peer */ 287 287 __u32 pmtu_cookie; /* Last pmtu seen by socket */ 288 288 __u32 mss_cache; /* Cached effective mss, not including SACKS */ 289 - __u16 mss_cache_std; /* Like mss_cache, but without TSO */ 289 + __u16 xmit_size_goal; /* Goal for segmenting output packets */ 290 290 __u16 ext_header_len; /* Network protocol overhead (IP/IPv6 options) */ 291 291 __u8 ca_state; /* State of fast-retransmit machine */ 292 292 __u8 retransmits; /* Number of unrecovered RTO timeouts. */

+3 -14

include/net/pkt_sched.h

··· 13 13 14 14 extern rwlock_t qdisc_tree_lock; 15 15 16 - #define QDISC_ALIGN 32 17 - #define QDISC_ALIGN_CONST (QDISC_ALIGN - 1) 16 + #define QDISC_ALIGNTO 32 17 + #define QDISC_ALIGN(len) (((len) + QDISC_ALIGNTO-1) & ~(QDISC_ALIGNTO-1)) 18 18 19 19 static inline void *qdisc_priv(struct Qdisc *q) 20 20 { 21 - return (char *)q + ((sizeof(struct Qdisc) + QDISC_ALIGN_CONST) 22 - & ~QDISC_ALIGN_CONST); 21 + return (char *) q + QDISC_ALIGN(sizeof(struct Qdisc)); 23 22 } 24 23 25 24 /* ··· 206 207 207 208 #endif /* !CONFIG_NET_SCH_CLK_GETTIMEOFDAY */ 208 209 209 - extern struct Qdisc noop_qdisc; 210 - extern struct Qdisc_ops noop_qdisc_ops; 211 210 extern struct Qdisc_ops pfifo_qdisc_ops; 212 211 extern struct Qdisc_ops bfifo_qdisc_ops; 213 212 ··· 213 216 extern int unregister_qdisc(struct Qdisc_ops *qops); 214 217 extern struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle); 215 218 extern struct Qdisc *qdisc_lookup_class(struct net_device *dev, u32 handle); 216 - extern void dev_init_scheduler(struct net_device *dev); 217 - extern void dev_shutdown(struct net_device *dev); 218 - extern void dev_activate(struct net_device *dev); 219 - extern void dev_deactivate(struct net_device *dev); 220 - extern void qdisc_reset(struct Qdisc *qdisc); 221 - extern void qdisc_destroy(struct Qdisc *qdisc); 222 - extern struct Qdisc * qdisc_create_dflt(struct net_device *dev, 223 - struct Qdisc_ops *ops); 224 219 extern struct qdisc_rate_table *qdisc_get_rtab(struct tc_ratespec *r, 225 220 struct rtattr *tab); 226 221 extern void qdisc_put_rtab(struct qdisc_rate_table *tab);

+13

include/net/sch_generic.h

··· 164 164 #define tcf_tree_lock(tp) qdisc_lock_tree((tp)->q->dev) 165 165 #define tcf_tree_unlock(tp) qdisc_unlock_tree((tp)->q->dev) 166 166 167 + extern struct Qdisc noop_qdisc; 168 + extern struct Qdisc_ops noop_qdisc_ops; 169 + 170 + extern void dev_init_scheduler(struct net_device *dev); 171 + extern void dev_shutdown(struct net_device *dev); 172 + extern void dev_activate(struct net_device *dev); 173 + extern void dev_deactivate(struct net_device *dev); 174 + extern void qdisc_reset(struct Qdisc *qdisc); 175 + extern void qdisc_destroy(struct Qdisc *qdisc); 176 + extern struct Qdisc *qdisc_alloc(struct net_device *dev, struct Qdisc_ops *ops); 177 + extern struct Qdisc *qdisc_create_dflt(struct net_device *dev, 178 + struct Qdisc_ops *ops); 179 + 167 180 static inline void 168 181 tcf_destroy(struct tcf_proto *tp) 169 182 {

+7 -12

include/net/slhc_vj.h

··· 170 170 }; 171 171 #define NULLSLCOMPR (struct slcompress *)0 172 172 173 - #define __ARGS(x) x 174 - 175 173 /* In slhc.c: */ 176 - struct slcompress *slhc_init __ARGS((int rslots, int tslots)); 177 - void slhc_free __ARGS((struct slcompress *comp)); 174 + struct slcompress *slhc_init(int rslots, int tslots); 175 + void slhc_free(struct slcompress *comp); 178 176 179 - int slhc_compress __ARGS((struct slcompress *comp, unsigned char *icp, 180 - int isize, unsigned char *ocp, unsigned char **cpp, 181 - int compress_cid)); 182 - int slhc_uncompress __ARGS((struct slcompress *comp, unsigned char *icp, 183 - int isize)); 184 - int slhc_remember __ARGS((struct slcompress *comp, unsigned char *icp, 185 - int isize)); 186 - int slhc_toss __ARGS((struct slcompress *comp)); 177 + int slhc_compress(struct slcompress *comp, unsigned char *icp, int isize, 178 + unsigned char *ocp, unsigned char **cpp, int compress_cid); 179 + int slhc_uncompress(struct slcompress *comp, unsigned char *icp, int isize); 180 + int slhc_remember(struct slcompress *comp, unsigned char *icp, int isize); 181 + int slhc_toss(struct slcompress *comp); 187 182 188 183 #endif /* _SLHC_H */

+5 -2

include/net/sock.h

··· 1134 1134 static inline struct sk_buff *sk_stream_alloc_pskb(struct sock *sk, 1135 1135 int size, int mem, int gfp) 1136 1136 { 1137 - struct sk_buff *skb = alloc_skb(size + sk->sk_prot->max_header, gfp); 1137 + struct sk_buff *skb; 1138 + int hdr_len; 1138 1139 1140 + hdr_len = SKB_DATA_ALIGN(sk->sk_prot->max_header); 1141 + skb = alloc_skb(size + hdr_len, gfp); 1139 1142 if (skb) { 1140 1143 skb->truesize += mem; 1141 1144 if (sk->sk_forward_alloc >= (int)skb->truesize || 1142 1145 sk_stream_mem_schedule(sk, skb->truesize, 0)) { 1143 - skb_reserve(skb, sk->sk_prot->max_header); 1146 + skb_reserve(skb, hdr_len); 1144 1147 return skb; 1145 1148 } 1146 1149 __kfree_skb(skb);

+17 -139

include/net/tcp.h

··· 721 721 return tp->ack.pending&TCP_ACK_SCHED; 722 722 } 723 723 724 - static __inline__ void tcp_dec_quickack_mode(struct tcp_sock *tp) 724 + static __inline__ void tcp_dec_quickack_mode(struct tcp_sock *tp, unsigned int pkts) 725 725 { 726 - if (tp->ack.quick && --tp->ack.quick == 0) { 727 - /* Leaving quickack mode we deflate ATO. */ 728 - tp->ack.ato = TCP_ATO_MIN; 726 + if (tp->ack.quick) { 727 + if (pkts >= tp->ack.quick) { 728 + tp->ack.quick = 0; 729 + 730 + /* Leaving quickack mode we deflate ATO. */ 731 + tp->ack.ato = TCP_ATO_MIN; 732 + } else 733 + tp->ack.quick -= pkts; 729 734 } 730 735 } 731 736 ··· 848 843 849 844 /* tcp_output.c */ 850 845 851 - extern int tcp_write_xmit(struct sock *, int nonagle); 846 + extern void __tcp_push_pending_frames(struct sock *sk, struct tcp_sock *tp, 847 + unsigned int cur_mss, int nonagle); 848 + extern int tcp_may_send_now(struct sock *sk, struct tcp_sock *tp); 852 849 extern int tcp_retransmit_skb(struct sock *, struct sk_buff *); 853 850 extern void tcp_xmit_retransmit_queue(struct sock *); 854 851 extern void tcp_simple_retransmit(struct sock *); ··· 862 855 extern void tcp_send_fin(struct sock *sk); 863 856 extern void tcp_send_active_reset(struct sock *sk, int priority); 864 857 extern int tcp_send_synack(struct sock *); 865 - extern void tcp_push_one(struct sock *, unsigned mss_now); 858 + extern void tcp_push_one(struct sock *, unsigned int mss_now); 866 859 extern void tcp_send_ack(struct sock *sk); 867 860 extern void tcp_send_delayed_ack(struct sock *sk); 861 + 862 + /* tcp_input.c */ 863 + extern void tcp_cwnd_application_limited(struct sock *sk); 868 864 869 865 /* tcp_timer.c */ 870 866 extern void tcp_init_xmit_timers(struct sock *); ··· 968 958 static inline void tcp_initialize_rcv_mss(struct sock *sk) 969 959 { 970 960 struct tcp_sock *tp = tcp_sk(sk); 971 - unsigned int hint = min(tp->advmss, tp->mss_cache_std); 961 + unsigned int hint = min_t(unsigned int, tp->advmss, tp->mss_cache); 972 962 973 963 hint = min(hint, tp->rcv_wnd/2); 974 964 hint = min(hint, TCP_MIN_RCVMSS); ··· 1235 1225 tp->left_out = tp->sacked_out + tp->lost_out; 1236 1226 } 1237 1227 1238 - extern void tcp_cwnd_application_limited(struct sock *sk); 1239 - 1240 - /* Congestion window validation. (RFC2861) */ 1241 - 1242 - static inline void tcp_cwnd_validate(struct sock *sk, struct tcp_sock *tp) 1243 - { 1244 - __u32 packets_out = tp->packets_out; 1245 - 1246 - if (packets_out >= tp->snd_cwnd) { 1247 - /* Network is feed fully. */ 1248 - tp->snd_cwnd_used = 0; 1249 - tp->snd_cwnd_stamp = tcp_time_stamp; 1250 - } else { 1251 - /* Network starves. */ 1252 - if (tp->packets_out > tp->snd_cwnd_used) 1253 - tp->snd_cwnd_used = tp->packets_out; 1254 - 1255 - if ((s32)(tcp_time_stamp - tp->snd_cwnd_stamp) >= tp->rto) 1256 - tcp_cwnd_application_limited(sk); 1257 - } 1258 - } 1259 - 1260 1228 /* Set slow start threshould and cwnd not falling to slow start */ 1261 1229 static inline void __tcp_enter_cwr(struct tcp_sock *tp) 1262 1230 { ··· 1267 1279 return 3; 1268 1280 } 1269 1281 1270 - static __inline__ int tcp_minshall_check(const struct tcp_sock *tp) 1271 - { 1272 - return after(tp->snd_sml,tp->snd_una) && 1273 - !after(tp->snd_sml, tp->snd_nxt); 1274 - } 1275 - 1276 1282 static __inline__ void tcp_minshall_update(struct tcp_sock *tp, int mss, 1277 1283 const struct sk_buff *skb) 1278 1284 { 1279 1285 if (skb->len < mss) 1280 1286 tp->snd_sml = TCP_SKB_CB(skb)->end_seq; 1281 - } 1282 - 1283 - /* Return 0, if packet can be sent now without violation Nagle's rules: 1284 - 1. It is full sized. 1285 - 2. Or it contains FIN. 1286 - 3. Or TCP_NODELAY was set. 1287 - 4. Or TCP_CORK is not set, and all sent packets are ACKed. 1288 - With Minshall's modification: all sent small packets are ACKed. 1289 - */ 1290 - 1291 - static __inline__ int 1292 - tcp_nagle_check(const struct tcp_sock *tp, const struct sk_buff *skb, 1293 - unsigned mss_now, int nonagle) 1294 - { 1295 - return (skb->len < mss_now && 1296 - !(TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) && 1297 - ((nonagle&TCP_NAGLE_CORK) || 1298 - (!nonagle && 1299 - tp->packets_out && 1300 - tcp_minshall_check(tp)))); 1301 - } 1302 - 1303 - extern void tcp_set_skb_tso_segs(struct sock *, struct sk_buff *); 1304 - 1305 - /* This checks if the data bearing packet SKB (usually sk->sk_send_head) 1306 - * should be put on the wire right now. 1307 - */ 1308 - static __inline__ int tcp_snd_test(struct sock *sk, 1309 - struct sk_buff *skb, 1310 - unsigned cur_mss, int nonagle) 1311 - { 1312 - struct tcp_sock *tp = tcp_sk(sk); 1313 - int pkts = tcp_skb_pcount(skb); 1314 - 1315 - if (!pkts) { 1316 - tcp_set_skb_tso_segs(sk, skb); 1317 - pkts = tcp_skb_pcount(skb); 1318 - } 1319 - 1320 - /* RFC 1122 - section 4.2.3.4 1321 - * 1322 - * We must queue if 1323 - * 1324 - * a) The right edge of this frame exceeds the window 1325 - * b) There are packets in flight and we have a small segment 1326 - * [SWS avoidance and Nagle algorithm] 1327 - * (part of SWS is done on packetization) 1328 - * Minshall version sounds: there are no _small_ 1329 - * segments in flight. (tcp_nagle_check) 1330 - * c) We have too many packets 'in flight' 1331 - * 1332 - * Don't use the nagle rule for urgent data (or 1333 - * for the final FIN -DaveM). 1334 - * 1335 - * Also, Nagle rule does not apply to frames, which 1336 - * sit in the middle of queue (they have no chances 1337 - * to get new data) and if room at tail of skb is 1338 - * not enough to save something seriously (<32 for now). 1339 - */ 1340 - 1341 - /* Don't be strict about the congestion window for the 1342 - * final FIN frame. -DaveM 1343 - */ 1344 - return (((nonagle&TCP_NAGLE_PUSH) || tp->urg_mode 1345 - || !tcp_nagle_check(tp, skb, cur_mss, nonagle)) && 1346 - (((tcp_packets_in_flight(tp) + (pkts-1)) < tp->snd_cwnd) || 1347 - (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)) && 1348 - !after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd)); 1349 1287 } 1350 1288 1351 1289 static __inline__ void tcp_check_probe_timer(struct sock *sk, struct tcp_sock *tp) ··· 1280 1366 tcp_reset_xmit_timer(sk, TCP_TIME_PROBE0, tp->rto); 1281 1367 } 1282 1368 1283 - static __inline__ int tcp_skb_is_last(const struct sock *sk, 1284 - const struct sk_buff *skb) 1285 - { 1286 - return skb->next == (struct sk_buff *)&sk->sk_write_queue; 1287 - } 1288 - 1289 - /* Push out any pending frames which were held back due to 1290 - * TCP_CORK or attempt at coalescing tiny packets. 1291 - * The socket must be locked by the caller. 1292 - */ 1293 - static __inline__ void __tcp_push_pending_frames(struct sock *sk, 1294 - struct tcp_sock *tp, 1295 - unsigned cur_mss, 1296 - int nonagle) 1297 - { 1298 - struct sk_buff *skb = sk->sk_send_head; 1299 - 1300 - if (skb) { 1301 - if (!tcp_skb_is_last(sk, skb)) 1302 - nonagle = TCP_NAGLE_PUSH; 1303 - if (!tcp_snd_test(sk, skb, cur_mss, nonagle) || 1304 - tcp_write_xmit(sk, nonagle)) 1305 - tcp_check_probe_timer(sk, tp); 1306 - } 1307 - tcp_cwnd_validate(sk, tp); 1308 - } 1309 - 1310 1369 static __inline__ void tcp_push_pending_frames(struct sock *sk, 1311 1370 struct tcp_sock *tp) 1312 1371 { 1313 1372 __tcp_push_pending_frames(sk, tp, tcp_current_mss(sk, 1), tp->nonagle); 1314 - } 1315 - 1316 - static __inline__ int tcp_may_send_now(struct sock *sk, struct tcp_sock *tp) 1317 - { 1318 - struct sk_buff *skb = sk->sk_send_head; 1319 - 1320 - return (skb && 1321 - tcp_snd_test(sk, skb, tcp_current_mss(sk, 1), 1322 - tcp_skb_is_last(sk, skb) ? TCP_NAGLE_PUSH : tp->nonagle)); 1323 1373 } 1324 1374 1325 1375 static __inline__ void tcp_init_wl(struct tcp_sock *tp, u32 ack, u32 seq)

+3 -2

net/core/dev.c

··· 2089 2089 { 2090 2090 unsigned short old_flags = dev->flags; 2091 2091 2092 - dev->flags |= IFF_PROMISC; 2093 2092 if ((dev->promiscuity += inc) == 0) 2094 2093 dev->flags &= ~IFF_PROMISC; 2095 - if (dev->flags ^ old_flags) { 2094 + else 2095 + dev->flags |= IFF_PROMISC; 2096 + if (dev->flags != old_flags) { 2096 2097 dev_mc_upload(dev); 2097 2098 printk(KERN_INFO "device %s %s promiscuous mode\n", 2098 2099 dev->name, (dev->flags & IFF_PROMISC) ? "entered" :

+32 -72

net/core/filter.c

··· 36 36 #include <linux/filter.h> 37 37 38 38 /* No hurry in this branch */ 39 - static u8 *load_pointer(struct sk_buff *skb, int k) 39 + static void *__load_pointer(struct sk_buff *skb, int k) 40 40 { 41 41 u8 *ptr = NULL; 42 42 ··· 48 48 if (ptr >= skb->head && ptr < skb->tail) 49 49 return ptr; 50 50 return NULL; 51 + } 52 + 53 + static inline void *load_pointer(struct sk_buff *skb, int k, 54 + unsigned int size, void *buffer) 55 + { 56 + if (k >= 0) 57 + return skb_header_pointer(skb, k, size, buffer); 58 + else { 59 + if (k >= SKF_AD_OFF) 60 + return NULL; 61 + return __load_pointer(skb, k); 62 + } 51 63 } 52 64 53 65 /** ··· 76 64 77 65 int sk_run_filter(struct sk_buff *skb, struct sock_filter *filter, int flen) 78 66 { 79 - unsigned char *data = skb->data; 80 - /* len is UNSIGNED. Byte wide insns relies only on implicit 81 - type casts to prevent reading arbitrary memory locations. 82 - */ 83 - unsigned int len = skb->len-skb->data_len; 84 67 struct sock_filter *fentry; /* We walk down these */ 68 + void *ptr; 85 69 u32 A = 0; /* Accumulator */ 86 70 u32 X = 0; /* Index Register */ 87 71 u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */ 72 + u32 tmp; 88 73 int k; 89 74 int pc; 90 75 ··· 177 168 case BPF_LD|BPF_W|BPF_ABS: 178 169 k = fentry->k; 179 170 load_w: 180 - if (k >= 0 && (unsigned int)(k+sizeof(u32)) <= len) { 181 - A = ntohl(*(u32*)&data[k]); 171 + ptr = load_pointer(skb, k, 4, &tmp); 172 + if (ptr != NULL) { 173 + A = ntohl(*(u32 *)ptr); 182 174 continue; 183 - } 184 - if (k < 0) { 185 - u8 *ptr; 186 - 187 - if (k >= SKF_AD_OFF) 188 - break; 189 - ptr = load_pointer(skb, k); 190 - if (ptr) { 191 - A = ntohl(*(u32*)ptr); 192 - continue; 193 - } 194 - } else { 195 - u32 _tmp, *p; 196 - p = skb_header_pointer(skb, k, 4, &_tmp); 197 - if (p != NULL) { 198 - A = ntohl(*p); 199 - continue; 200 - } 201 175 } 202 176 return 0; 203 177 case BPF_LD|BPF_H|BPF_ABS: 204 178 k = fentry->k; 205 179 load_h: 206 - if (k >= 0 && (unsigned int)(k + sizeof(u16)) <= len) { 207 - A = ntohs(*(u16*)&data[k]); 180 + ptr = load_pointer(skb, k, 2, &tmp); 181 + if (ptr != NULL) { 182 + A = ntohs(*(u16 *)ptr); 208 183 continue; 209 - } 210 - if (k < 0) { 211 - u8 *ptr; 212 - 213 - if (k >= SKF_AD_OFF) 214 - break; 215 - ptr = load_pointer(skb, k); 216 - if (ptr) { 217 - A = ntohs(*(u16*)ptr); 218 - continue; 219 - } 220 - } else { 221 - u16 _tmp, *p; 222 - p = skb_header_pointer(skb, k, 2, &_tmp); 223 - if (p != NULL) { 224 - A = ntohs(*p); 225 - continue; 226 - } 227 184 } 228 185 return 0; 229 186 case BPF_LD|BPF_B|BPF_ABS: 230 187 k = fentry->k; 231 188 load_b: 232 - if (k >= 0 && (unsigned int)k < len) { 233 - A = data[k]; 189 + ptr = load_pointer(skb, k, 1, &tmp); 190 + if (ptr != NULL) { 191 + A = *(u8 *)ptr; 234 192 continue; 235 - } 236 - if (k < 0) { 237 - u8 *ptr; 238 - 239 - if (k >= SKF_AD_OFF) 240 - break; 241 - ptr = load_pointer(skb, k); 242 - if (ptr) { 243 - A = *ptr; 244 - continue; 245 - } 246 - } else { 247 - u8 _tmp, *p; 248 - p = skb_header_pointer(skb, k, 1, &_tmp); 249 - if (p != NULL) { 250 - A = *p; 251 - continue; 252 - } 253 193 } 254 194 return 0; 255 195 case BPF_LD|BPF_W|BPF_LEN: 256 - A = len; 196 + A = skb->len; 257 197 continue; 258 198 case BPF_LDX|BPF_W|BPF_LEN: 259 - X = len; 199 + X = skb->len; 260 200 continue; 261 201 case BPF_LD|BPF_W|BPF_IND: 262 202 k = X + fentry->k; ··· 217 259 k = X + fentry->k; 218 260 goto load_b; 219 261 case BPF_LDX|BPF_B|BPF_MSH: 220 - if (fentry->k >= len) 221 - return 0; 222 - X = (data[fentry->k] & 0xf) << 2; 223 - continue; 262 + ptr = load_pointer(skb, fentry->k, 1, &tmp); 263 + if (ptr != NULL) { 264 + X = (*(u8 *)ptr & 0xf) << 2; 265 + continue; 266 + } 267 + return 0; 224 268 case BPF_LD|BPF_IMM: 225 269 A = fentry->k; 226 270 continue;

-2

net/core/skbuff.c

··· 357 357 C(ip_summed); 358 358 C(priority); 359 359 C(protocol); 360 - C(security); 361 360 n->destructor = NULL; 362 361 #ifdef CONFIG_NETFILTER 363 362 C(nfmark); ··· 421 422 new->pkt_type = old->pkt_type; 422 423 new->stamp = old->stamp; 423 424 new->destructor = NULL; 424 - new->security = old->security; 425 425 #ifdef CONFIG_NETFILTER 426 426 new->nfmark = old->nfmark; 427 427 new->nfcache = old->nfcache;

+2 -1

net/decnet/dn_fib.c

··· 551 551 if (t < s_t) 552 552 continue; 553 553 if (t > s_t) 554 - memset(&cb->args[1], 0, sizeof(cb->args)-sizeof(int)); 554 + memset(&cb->args[1], 0, 555 + sizeof(cb->args) - sizeof(cb->args[0])); 555 556 tb = dn_fib_get_table(t, 0); 556 557 if (tb == NULL) 557 558 continue;

+11

net/ipv4/af_inet.c

··· 1009 1009 static int ipv4_proc_init(void); 1010 1010 extern void ipfrag_init(void); 1011 1011 1012 + /* 1013 + * IP protocol layer initialiser 1014 + */ 1015 + 1016 + static struct packet_type ip_packet_type = { 1017 + .type = __constant_htons(ETH_P_IP), 1018 + .func = ip_rcv, 1019 + }; 1020 + 1012 1021 static int __init inet_init(void) 1013 1022 { 1014 1023 struct sk_buff *dummy_skb; ··· 1110 1101 ipv4_proc_init(); 1111 1102 1112 1103 ipfrag_init(); 1104 + 1105 + dev_add_pack(&ip_packet_type); 1113 1106 1114 1107 rc = 0; 1115 1108 out:

+168 -34

net/ipv4/fib_trie.c

··· 43 43 * 2 of the License, or (at your option) any later version. 44 44 */ 45 45 46 - #define VERSION "0.324" 46 + #define VERSION "0.325" 47 47 48 48 #include <linux/config.h> 49 49 #include <asm/uaccess.h> ··· 136 136 unsigned int semantic_match_passed; 137 137 unsigned int semantic_match_miss; 138 138 unsigned int null_node_hit; 139 + unsigned int resize_node_skipped; 139 140 }; 140 141 #endif 141 142 ··· 165 164 static void tnode_put_child_reorg(struct tnode *tn, int i, struct node *n, int wasfull); 166 165 static int tnode_child_length(struct tnode *tn); 167 166 static struct node *resize(struct trie *t, struct tnode *tn); 168 - static struct tnode *inflate(struct trie *t, struct tnode *tn); 169 - static struct tnode *halve(struct trie *t, struct tnode *tn); 167 + static struct tnode *inflate(struct trie *t, struct tnode *tn, int *err); 168 + static struct tnode *halve(struct trie *t, struct tnode *tn, int *err); 170 169 static void tnode_free(struct tnode *tn); 171 170 static void trie_dump_seq(struct seq_file *seq, struct trie *t); 172 171 extern struct fib_alias *fib_find_alias(struct list_head *fah, u8 tos, u32 prio); ··· 359 358 kfree(li); 360 359 } 361 360 361 + static struct tnode *tnode_alloc(unsigned int size) 362 + { 363 + if (size <= PAGE_SIZE) { 364 + return kmalloc(size, GFP_KERNEL); 365 + } else { 366 + return (struct tnode *) 367 + __get_free_pages(GFP_KERNEL, get_order(size)); 368 + } 369 + } 370 + 371 + static void __tnode_free(struct tnode *tn) 372 + { 373 + unsigned int size = sizeof(struct tnode) + 374 + (1<<tn->bits) * sizeof(struct node *); 375 + 376 + if (size <= PAGE_SIZE) 377 + kfree(tn); 378 + else 379 + free_pages((unsigned long)tn, get_order(size)); 380 + } 381 + 362 382 static struct tnode* tnode_new(t_key key, int pos, int bits) 363 383 { 364 384 int nchildren = 1<<bits; 365 385 int sz = sizeof(struct tnode) + nchildren * sizeof(struct node *); 366 - struct tnode *tn = kmalloc(sz, GFP_KERNEL); 386 + struct tnode *tn = tnode_alloc(sz); 367 387 368 388 if(tn) { 369 389 memset(tn, 0, sz); ··· 412 390 printk("FL %p \n", tn); 413 391 } 414 392 else if(IS_TNODE(tn)) { 415 - kfree(tn); 393 + __tnode_free(tn); 416 394 if(trie_debug > 0 ) 417 395 printk("FT %p \n", tn); 418 396 } ··· 482 460 static struct node *resize(struct trie *t, struct tnode *tn) 483 461 { 484 462 int i; 463 + int err = 0; 485 464 486 465 if (!tn) 487 466 return NULL; ··· 579 556 */ 580 557 581 558 check_tnode(tn); 582 - 559 + 560 + err = 0; 583 561 while ((tn->full_children > 0 && 584 562 50 * (tn->full_children + tnode_child_length(tn) - tn->empty_children) >= 585 563 inflate_threshold * tnode_child_length(tn))) { 586 564 587 - tn = inflate(t, tn); 565 + tn = inflate(t, tn, &err); 566 + 567 + if(err) { 568 + #ifdef CONFIG_IP_FIB_TRIE_STATS 569 + t->stats.resize_node_skipped++; 570 + #endif 571 + break; 572 + } 588 573 } 589 574 590 575 check_tnode(tn); ··· 601 570 * Halve as long as the number of empty children in this 602 571 * node is above threshold. 603 572 */ 573 + 574 + err = 0; 604 575 while (tn->bits > 1 && 605 576 100 * (tnode_child_length(tn) - tn->empty_children) < 606 - halve_threshold * tnode_child_length(tn)) 577 + halve_threshold * tnode_child_length(tn)) { 607 578 608 - tn = halve(t, tn); 579 + tn = halve(t, tn, &err); 580 + 581 + if(err) { 582 + #ifdef CONFIG_IP_FIB_TRIE_STATS 583 + t->stats.resize_node_skipped++; 584 + #endif 585 + break; 586 + } 587 + } 588 + 609 589 610 590 /* Only one child remains */ 611 591 ··· 641 599 return (struct node *) tn; 642 600 } 643 601 644 - static struct tnode *inflate(struct trie *t, struct tnode *tn) 602 + static struct tnode *inflate(struct trie *t, struct tnode *tn, int *err) 645 603 { 646 604 struct tnode *inode; 647 605 struct tnode *oldtnode = tn; ··· 653 611 654 612 tn = tnode_new(oldtnode->key, oldtnode->pos, oldtnode->bits + 1); 655 613 656 - if (!tn) 657 - trie_bug("tnode_new failed"); 614 + if (!tn) { 615 + *err = -ENOMEM; 616 + return oldtnode; 617 + } 618 + 619 + /* 620 + * Preallocate and store tnodes before the actual work so we 621 + * don't get into an inconsistent state if memory allocation 622 + * fails. In case of failure we return the oldnode and inflate 623 + * of tnode is ignored. 624 + */ 625 + 626 + for(i = 0; i < olen; i++) { 627 + struct tnode *inode = (struct tnode *) tnode_get_child(oldtnode, i); 628 + 629 + if (inode && 630 + IS_TNODE(inode) && 631 + inode->pos == oldtnode->pos + oldtnode->bits && 632 + inode->bits > 1) { 633 + struct tnode *left, *right; 634 + 635 + t_key m = TKEY_GET_MASK(inode->pos, 1); 636 + 637 + left = tnode_new(inode->key&(~m), inode->pos + 1, 638 + inode->bits - 1); 639 + 640 + if(!left) { 641 + *err = -ENOMEM; 642 + break; 643 + } 644 + 645 + right = tnode_new(inode->key|m, inode->pos + 1, 646 + inode->bits - 1); 647 + 648 + if(!right) { 649 + *err = -ENOMEM; 650 + break; 651 + } 652 + 653 + put_child(t, tn, 2*i, (struct node *) left); 654 + put_child(t, tn, 2*i+1, (struct node *) right); 655 + } 656 + } 657 + 658 + if(*err) { 659 + int size = tnode_child_length(tn); 660 + int j; 661 + 662 + for(j = 0; j < size; j++) 663 + if( tn->child[j]) 664 + tnode_free((struct tnode *)tn->child[j]); 665 + 666 + tnode_free(tn); 667 + 668 + *err = -ENOMEM; 669 + return oldtnode; 670 + } 658 671 659 672 for(i = 0; i < olen; i++) { 660 673 struct node *node = tnode_get_child(oldtnode, i); ··· 722 625 723 626 if(IS_LEAF(node) || ((struct tnode *) node)->pos > 724 627 tn->pos + tn->bits - 1) { 725 - if(tkey_extract_bits(node->key, tn->pos + tn->bits - 1, 628 + if(tkey_extract_bits(node->key, oldtnode->pos + oldtnode->bits, 726 629 1) == 0) 727 630 put_child(t, tn, 2*i, node); 728 631 else ··· 762 665 * the position (inode->pos) 763 666 */ 764 667 765 - t_key m = TKEY_GET_MASK(inode->pos, 1); 766 - 767 668 /* Use the old key, but set the new significant 768 669 * bit to zero. 769 670 */ 770 - left = tnode_new(inode->key&(~m), inode->pos + 1, 771 - inode->bits - 1); 772 671 773 - if(!left) 774 - trie_bug("tnode_new failed"); 775 - 776 - 777 - /* Use the old key, but set the new significant 778 - * bit to one. 779 - */ 780 - right = tnode_new(inode->key|m, inode->pos + 1, 781 - inode->bits - 1); 672 + left = (struct tnode *) tnode_get_child(tn, 2*i); 673 + put_child(t, tn, 2*i, NULL); 782 674 783 - if(!right) 784 - trie_bug("tnode_new failed"); 785 - 675 + if(!left) 676 + BUG(); 677 + 678 + right = (struct tnode *) tnode_get_child(tn, 2*i+1); 679 + put_child(t, tn, 2*i+1, NULL); 680 + 681 + if(!right) 682 + BUG(); 683 + 786 684 size = tnode_child_length(left); 787 685 for(j = 0; j < size; j++) { 788 686 put_child(t, left, j, inode->child[j]); ··· 793 701 return tn; 794 702 } 795 703 796 - static struct tnode *halve(struct trie *t, struct tnode *tn) 704 + static struct tnode *halve(struct trie *t, struct tnode *tn, int *err) 797 705 { 798 706 struct tnode *oldtnode = tn; 799 707 struct node *left, *right; ··· 804 712 805 713 tn=tnode_new(oldtnode->key, oldtnode->pos, oldtnode->bits - 1); 806 714 807 - if(!tn) 808 - trie_bug("tnode_new failed"); 715 + if (!tn) { 716 + *err = -ENOMEM; 717 + return oldtnode; 718 + } 719 + 720 + /* 721 + * Preallocate and store tnodes before the actual work so we 722 + * don't get into an inconsistent state if memory allocation 723 + * fails. In case of failure we return the oldnode and halve 724 + * of tnode is ignored. 725 + */ 726 + 727 + for(i = 0; i < olen; i += 2) { 728 + left = tnode_get_child(oldtnode, i); 729 + right = tnode_get_child(oldtnode, i+1); 730 + 731 + /* Two nonempty children */ 732 + if( left && right) { 733 + struct tnode *newBinNode = 734 + tnode_new(left->key, tn->pos + tn->bits, 1); 735 + 736 + if(!newBinNode) { 737 + *err = -ENOMEM; 738 + break; 739 + } 740 + put_child(t, tn, i/2, (struct node *)newBinNode); 741 + } 742 + } 743 + 744 + if(*err) { 745 + int size = tnode_child_length(tn); 746 + int j; 747 + 748 + for(j = 0; j < size; j++) 749 + if( tn->child[j]) 750 + tnode_free((struct tnode *)tn->child[j]); 751 + 752 + tnode_free(tn); 753 + 754 + *err = -ENOMEM; 755 + return oldtnode; 756 + } 809 757 810 758 for(i = 0; i < olen; i += 2) { 811 759 left = tnode_get_child(oldtnode, i); ··· 862 730 /* Two nonempty children */ 863 731 else { 864 732 struct tnode *newBinNode = 865 - tnode_new(left->key, tn->pos + tn->bits, 1); 733 + (struct tnode *) tnode_get_child(tn, i/2); 734 + put_child(t, tn, i/2, NULL); 866 735 867 736 if(!newBinNode) 868 - trie_bug("tnode_new failed"); 737 + BUG(); 869 738 870 739 put_child(t, newBinNode, 0, left); 871 740 put_child(t, newBinNode, 1, right); ··· 2434 2301 seq_printf(seq,"semantic match passed = %d\n", t->stats.semantic_match_passed); 2435 2302 seq_printf(seq,"semantic match miss = %d\n", t->stats.semantic_match_miss); 2436 2303 seq_printf(seq,"null node hit= %d\n", t->stats.null_node_hit); 2304 + seq_printf(seq,"skipped node resize = %d\n", t->stats.resize_node_skipped); 2437 2305 #ifdef CLEAR_STATS 2438 2306 memset(&(t->stats), 0, sizeof(t->stats)); 2439 2307 #endif

-16

net/ipv4/ip_output.c

··· 389 389 to->pkt_type = from->pkt_type; 390 390 to->priority = from->priority; 391 391 to->protocol = from->protocol; 392 - to->security = from->security; 393 392 dst_release(to->dst); 394 393 to->dst = dst_clone(from->dst); 395 394 to->dev = from->dev; ··· 1328 1329 ip_rt_put(rt); 1329 1330 } 1330 1331 1331 - /* 1332 - * IP protocol layer initialiser 1333 - */ 1334 - 1335 - static struct packet_type ip_packet_type = { 1336 - .type = __constant_htons(ETH_P_IP), 1337 - .func = ip_rcv, 1338 - }; 1339 - 1340 - /* 1341 - * IP registers the packet type and then calls the subprotocol initialisers 1342 - */ 1343 - 1344 1332 void __init ip_init(void) 1345 1333 { 1346 - dev_add_pack(&ip_packet_type); 1347 - 1348 1334 ip_rt_init(); 1349 1335 inet_initpeers(); 1350 1336

+74 -50

net/ipv4/route.c

··· 54 54 * Marc Boucher : routing by fwmark 55 55 * Robert Olsson : Added rt_cache statistics 56 56 * Arnaldo C. Melo : Convert proc stuff to seq_file 57 + * Eric Dumazet : hashed spinlocks and rt_check_expire() fixes. 57 58 * 58 59 * This program is free software; you can redistribute it and/or 59 60 * modify it under the terms of the GNU General Public License ··· 71 70 #include <linux/kernel.h> 72 71 #include <linux/sched.h> 73 72 #include <linux/mm.h> 73 + #include <linux/bootmem.h> 74 74 #include <linux/string.h> 75 75 #include <linux/socket.h> 76 76 #include <linux/sockios.h> ··· 203 201 204 202 struct rt_hash_bucket { 205 203 struct rtable *chain; 206 - spinlock_t lock; 207 - } __attribute__((__aligned__(8))); 204 + }; 205 + #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) 206 + /* 207 + * Instead of using one spinlock for each rt_hash_bucket, we use a table of spinlocks 208 + * The size of this table is a power of two and depends on the number of CPUS. 209 + */ 210 + #if NR_CPUS >= 32 211 + #define RT_HASH_LOCK_SZ 4096 212 + #elif NR_CPUS >= 16 213 + #define RT_HASH_LOCK_SZ 2048 214 + #elif NR_CPUS >= 8 215 + #define RT_HASH_LOCK_SZ 1024 216 + #elif NR_CPUS >= 4 217 + #define RT_HASH_LOCK_SZ 512 218 + #else 219 + #define RT_HASH_LOCK_SZ 256 220 + #endif 221 + 222 + static spinlock_t *rt_hash_locks; 223 + # define rt_hash_lock_addr(slot) &rt_hash_locks[(slot) & (RT_HASH_LOCK_SZ - 1)] 224 + # define rt_hash_lock_init() { \ 225 + int i; \ 226 + rt_hash_locks = kmalloc(sizeof(spinlock_t) * RT_HASH_LOCK_SZ, GFP_KERNEL); \ 227 + if (!rt_hash_locks) panic("IP: failed to allocate rt_hash_locks\n"); \ 228 + for (i = 0; i < RT_HASH_LOCK_SZ; i++) \ 229 + spin_lock_init(&rt_hash_locks[i]); \ 230 + } 231 + #else 232 + # define rt_hash_lock_addr(slot) NULL 233 + # define rt_hash_lock_init() 234 + #endif 208 235 209 236 static struct rt_hash_bucket *rt_hash_table; 210 237 static unsigned rt_hash_mask; ··· 606 575 /* This runs via a timer and thus is always in BH context. */ 607 576 static void rt_check_expire(unsigned long dummy) 608 577 { 609 - static int rover; 610 - int i = rover, t; 578 + static unsigned int rover; 579 + unsigned int i = rover, goal; 611 580 struct rtable *rth, **rthp; 612 581 unsigned long now = jiffies; 582 + u64 mult; 613 583 614 - for (t = ip_rt_gc_interval << rt_hash_log; t >= 0; 615 - t -= ip_rt_gc_timeout) { 584 + mult = ((u64)ip_rt_gc_interval) << rt_hash_log; 585 + if (ip_rt_gc_timeout > 1) 586 + do_div(mult, ip_rt_gc_timeout); 587 + goal = (unsigned int)mult; 588 + if (goal > rt_hash_mask) goal = rt_hash_mask + 1; 589 + for (; goal > 0; goal--) { 616 590 unsigned long tmo = ip_rt_gc_timeout; 617 591 618 592 i = (i + 1) & rt_hash_mask; 619 593 rthp = &rt_hash_table[i].chain; 620 594 621 - spin_lock(&rt_hash_table[i].lock); 595 + if (*rthp == 0) 596 + continue; 597 + spin_lock(rt_hash_lock_addr(i)); 622 598 while ((rth = *rthp) != NULL) { 623 599 if (rth->u.dst.expires) { 624 600 /* Entry is expired even if it is in use */ ··· 658 620 rt_free(rth); 659 621 #endif /* CONFIG_IP_ROUTE_MULTIPATH_CACHED */ 660 622 } 661 - spin_unlock(&rt_hash_table[i].lock); 623 + spin_unlock(rt_hash_lock_addr(i)); 662 624 663 625 /* Fallback loop breaker. */ 664 626 if (time_after(jiffies, now)) 665 627 break; 666 628 } 667 629 rover = i; 668 - mod_timer(&rt_periodic_timer, now + ip_rt_gc_interval); 630 + mod_timer(&rt_periodic_timer, jiffies + ip_rt_gc_interval); 669 631 } 670 632 671 633 /* This can run from both BH and non-BH contexts, the latter ··· 681 643 get_random_bytes(&rt_hash_rnd, 4); 682 644 683 645 for (i = rt_hash_mask; i >= 0; i--) { 684 - spin_lock_bh(&rt_hash_table[i].lock); 646 + spin_lock_bh(rt_hash_lock_addr(i)); 685 647 rth = rt_hash_table[i].chain; 686 648 if (rth) 687 649 rt_hash_table[i].chain = NULL; 688 - spin_unlock_bh(&rt_hash_table[i].lock); 650 + spin_unlock_bh(rt_hash_lock_addr(i)); 689 651 690 652 for (; rth; rth = next) { 691 653 next = rth->u.rt_next; ··· 818 780 819 781 k = (k + 1) & rt_hash_mask; 820 782 rthp = &rt_hash_table[k].chain; 821 - spin_lock_bh(&rt_hash_table[k].lock); 783 + spin_lock_bh(rt_hash_lock_addr(k)); 822 784 while ((rth = *rthp) != NULL) { 823 785 if (!rt_may_expire(rth, tmo, expire)) { 824 786 tmo >>= 1; ··· 850 812 goal--; 851 813 #endif /* CONFIG_IP_ROUTE_MULTIPATH_CACHED */ 852 814 } 853 - spin_unlock_bh(&rt_hash_table[k].lock); 815 + spin_unlock_bh(rt_hash_lock_addr(k)); 854 816 if (goal <= 0) 855 817 break; 856 818 } ··· 920 882 921 883 rthp = &rt_hash_table[hash].chain; 922 884 923 - spin_lock_bh(&rt_hash_table[hash].lock); 885 + spin_lock_bh(rt_hash_lock_addr(hash)); 924 886 while ((rth = *rthp) != NULL) { 925 887 #ifdef CONFIG_IP_ROUTE_MULTIPATH_CACHED 926 888 if (!(rth->u.dst.flags & DST_BALANCED) && ··· 946 908 rth->u.dst.__use++; 947 909 dst_hold(&rth->u.dst); 948 910 rth->u.dst.lastuse = now; 949 - spin_unlock_bh(&rt_hash_table[hash].lock); 911 + spin_unlock_bh(rt_hash_lock_addr(hash)); 950 912 951 913 rt_drop(rt); 952 914 *rp = rth; ··· 987 949 if (rt->rt_type == RTN_UNICAST || rt->fl.iif == 0) { 988 950 int err = arp_bind_neighbour(&rt->u.dst); 989 951 if (err) { 990 - spin_unlock_bh(&rt_hash_table[hash].lock); 952 + spin_unlock_bh(rt_hash_lock_addr(hash)); 991 953 992 954 if (err != -ENOBUFS) { 993 955 rt_drop(rt); ··· 1028 990 } 1029 991 #endif 1030 992 rt_hash_table[hash].chain = rt; 1031 - spin_unlock_bh(&rt_hash_table[hash].lock); 993 + spin_unlock_bh(rt_hash_lock_addr(hash)); 1032 994 *rp = rt; 1033 995 return 0; 1034 996 } ··· 1096 1058 { 1097 1059 struct rtable **rthp; 1098 1060 1099 - spin_lock_bh(&rt_hash_table[hash].lock); 1061 + spin_lock_bh(rt_hash_lock_addr(hash)); 1100 1062 ip_rt_put(rt); 1101 1063 for (rthp = &rt_hash_table[hash].chain; *rthp; 1102 1064 rthp = &(*rthp)->u.rt_next) ··· 1105 1067 rt_free(rt); 1106 1068 break; 1107 1069 } 1108 - spin_unlock_bh(&rt_hash_table[hash].lock); 1070 + spin_unlock_bh(rt_hash_lock_addr(hash)); 1109 1071 } 1110 1072 1111 1073 void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw, ··· 3111 3073 3112 3074 int __init ip_rt_init(void) 3113 3075 { 3114 - int i, order, goal, rc = 0; 3076 + int rc = 0; 3115 3077 3116 3078 rt_hash_rnd = (int) ((num_physpages ^ (num_physpages>>8)) ^ 3117 3079 (jiffies ^ (jiffies >> 7))); 3118 3080 3119 3081 #ifdef CONFIG_NET_CLS_ROUTE 3082 + { 3083 + int order; 3120 3084 for (order = 0; 3121 3085 (PAGE_SIZE << order) < 256 * sizeof(struct ip_rt_acct) * NR_CPUS; order++) 3122 3086 /* NOTHING */; ··· 3126 3086 if (!ip_rt_acct) 3127 3087 panic("IP: failed to allocate ip_rt_acct\n"); 3128 3088 memset(ip_rt_acct, 0, PAGE_SIZE << order); 3089 + } 3129 3090 #endif 3130 3091 3131 3092 ipv4_dst_ops.kmem_cachep = kmem_cache_create("ip_dst_cache", ··· 3137 3096 if (!ipv4_dst_ops.kmem_cachep) 3138 3097 panic("IP: failed to allocate ip_dst_cache\n"); 3139 3098 3140 - goal = num_physpages >> (26 - PAGE_SHIFT); 3141 - if (rhash_entries) 3142 - goal = (rhash_entries * sizeof(struct rt_hash_bucket)) >> PAGE_SHIFT; 3143 - for (order = 0; (1UL << order) < goal; order++) 3144 - /* NOTHING */; 3145 - 3146 - do { 3147 - rt_hash_mask = (1UL << order) * PAGE_SIZE / 3148 - sizeof(struct rt_hash_bucket); 3149 - while (rt_hash_mask & (rt_hash_mask - 1)) 3150 - rt_hash_mask--; 3151 - rt_hash_table = (struct rt_hash_bucket *) 3152 - __get_free_pages(GFP_ATOMIC, order); 3153 - } while (rt_hash_table == NULL && --order > 0); 3154 - 3155 - if (!rt_hash_table) 3156 - panic("Failed to allocate IP route cache hash table\n"); 3157 - 3158 - printk(KERN_INFO "IP: routing cache hash table of %u buckets, %ldKbytes\n", 3159 - rt_hash_mask, 3160 - (long) (rt_hash_mask * sizeof(struct rt_hash_bucket)) / 1024); 3161 - 3162 - for (rt_hash_log = 0; (1 << rt_hash_log) != rt_hash_mask; rt_hash_log++) 3163 - /* NOTHING */; 3164 - 3165 - rt_hash_mask--; 3166 - for (i = 0; i <= rt_hash_mask; i++) { 3167 - spin_lock_init(&rt_hash_table[i].lock); 3168 - rt_hash_table[i].chain = NULL; 3169 - } 3099 + rt_hash_table = (struct rt_hash_bucket *) 3100 + alloc_large_system_hash("IP route cache", 3101 + sizeof(struct rt_hash_bucket), 3102 + rhash_entries, 3103 + (num_physpages >= 128 * 1024) ? 3104 + (27 - PAGE_SHIFT) : 3105 + (29 - PAGE_SHIFT), 3106 + HASH_HIGHMEM, 3107 + &rt_hash_log, 3108 + &rt_hash_mask, 3109 + 0); 3110 + memset(rt_hash_table, 0, (rt_hash_mask + 1) * sizeof(struct rt_hash_bucket)); 3111 + rt_hash_lock_init(); 3170 3112 3171 3113 ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1); 3172 3114 ip_rt_max_size = (rt_hash_mask + 1) * 16;

+24 -20

net/ipv4/tcp.c

··· 615 615 size_t psize, int flags) 616 616 { 617 617 struct tcp_sock *tp = tcp_sk(sk); 618 - int mss_now; 618 + int mss_now, size_goal; 619 619 int err; 620 620 ssize_t copied; 621 621 long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT); ··· 628 628 clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); 629 629 630 630 mss_now = tcp_current_mss(sk, !(flags&MSG_OOB)); 631 + size_goal = tp->xmit_size_goal; 631 632 copied = 0; 632 633 633 634 err = -EPIPE; ··· 642 641 int offset = poffset % PAGE_SIZE; 643 642 int size = min_t(size_t, psize, PAGE_SIZE - offset); 644 643 645 - if (!sk->sk_send_head || (copy = mss_now - skb->len) <= 0) { 644 + if (!sk->sk_send_head || (copy = size_goal - skb->len) <= 0) { 646 645 new_segment: 647 646 if (!sk_stream_memory_free(sk)) 648 647 goto wait_for_sndbuf; ··· 653 652 goto wait_for_memory; 654 653 655 654 skb_entail(sk, tp, skb); 656 - copy = mss_now; 655 + copy = size_goal; 657 656 } 658 657 659 658 if (copy > size) ··· 694 693 if (!(psize -= copy)) 695 694 goto out; 696 695 697 - if (skb->len != mss_now || (flags & MSG_OOB)) 696 + if (skb->len < mss_now || (flags & MSG_OOB)) 698 697 continue; 699 698 700 699 if (forced_push(tp)) { ··· 714 713 goto do_error; 715 714 716 715 mss_now = tcp_current_mss(sk, !(flags&MSG_OOB)); 716 + size_goal = tp->xmit_size_goal; 717 717 } 718 718 719 719 out: ··· 756 754 757 755 static inline int select_size(struct sock *sk, struct tcp_sock *tp) 758 756 { 759 - int tmp = tp->mss_cache_std; 757 + int tmp = tp->mss_cache; 760 758 761 759 if (sk->sk_route_caps & NETIF_F_SG) { 762 - int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER); 760 + if (sk->sk_route_caps & NETIF_F_TSO) 761 + tmp = 0; 762 + else { 763 + int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER); 763 764 764 - if (tmp >= pgbreak && 765 - tmp <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE) 766 - tmp = pgbreak; 765 + if (tmp >= pgbreak && 766 + tmp <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE) 767 + tmp = pgbreak; 768 + } 767 769 } 770 + 768 771 return tmp; 769 772 } 770 773 ··· 780 773 struct tcp_sock *tp = tcp_sk(sk); 781 774 struct sk_buff *skb; 782 775 int iovlen, flags; 783 - int mss_now; 776 + int mss_now, size_goal; 784 777 int err, copied; 785 778 long timeo; 786 779 ··· 799 792 clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); 800 793 801 794 mss_now = tcp_current_mss(sk, !(flags&MSG_OOB)); 795 + size_goal = tp->xmit_size_goal; 802 796 803 797 /* Ok commence sending. */ 804 798 iovlen = msg->msg_iovlen; ··· 822 814 skb = sk->sk_write_queue.prev; 823 815 824 816 if (!sk->sk_send_head || 825 - (copy = mss_now - skb->len) <= 0) { 817 + (copy = size_goal - skb->len) <= 0) { 826 818 827 819 new_segment: 828 820 /* Allocate new segment. If the interface is SG, ··· 845 837 skb->ip_summed = CHECKSUM_HW; 846 838 847 839 skb_entail(sk, tp, skb); 848 - copy = mss_now; 840 + copy = size_goal; 849 841 } 850 842 851 843 /* Try to append data to the end of skb. */ ··· 880 872 tcp_mark_push(tp, skb); 881 873 goto new_segment; 882 874 } else if (page) { 883 - /* If page is cached, align 884 - * offset to L1 cache boundary 885 - */ 886 - off = (off + L1_CACHE_BYTES - 1) & 887 - ~(L1_CACHE_BYTES - 1); 888 875 if (off == PAGE_SIZE) { 889 876 put_page(page); 890 877 TCP_PAGE(sk) = page = NULL; ··· 940 937 if ((seglen -= copy) == 0 && iovlen == 0) 941 938 goto out; 942 939 943 - if (skb->len != mss_now || (flags & MSG_OOB)) 940 + if (skb->len < mss_now || (flags & MSG_OOB)) 944 941 continue; 945 942 946 943 if (forced_push(tp)) { ··· 960 957 goto do_error; 961 958 962 959 mss_now = tcp_current_mss(sk, !(flags&MSG_OOB)); 960 + size_goal = tp->xmit_size_goal; 963 961 } 964 962 } 965 963 ··· 2132 2128 2133 2129 info->tcpi_rto = jiffies_to_usecs(tp->rto); 2134 2130 info->tcpi_ato = jiffies_to_usecs(tp->ack.ato); 2135 - info->tcpi_snd_mss = tp->mss_cache_std; 2131 + info->tcpi_snd_mss = tp->mss_cache; 2136 2132 info->tcpi_rcv_mss = tp->ack.rcv_mss; 2137 2133 2138 2134 info->tcpi_unacked = tp->packets_out; ··· 2182 2178 2183 2179 switch (optname) { 2184 2180 case TCP_MAXSEG: 2185 - val = tp->mss_cache_std; 2181 + val = tp->mss_cache; 2186 2182 if (!val && ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))) 2187 2183 val = tp->rx_opt.user_mss; 2188 2184 break;

+37 -39

net/ipv4/tcp_input.c

··· 740 740 __u32 cwnd = (dst ? dst_metric(dst, RTAX_INITCWND) : 0); 741 741 742 742 if (!cwnd) { 743 - if (tp->mss_cache_std > 1460) 743 + if (tp->mss_cache > 1460) 744 744 cwnd = 2; 745 745 else 746 - cwnd = (tp->mss_cache_std > 1095) ? 3 : 4; 746 + cwnd = (tp->mss_cache > 1095) ? 3 : 4; 747 747 } 748 748 return min_t(__u32, cwnd, tp->snd_cwnd_clamp); 749 749 } ··· 914 914 if (sk->sk_route_caps & NETIF_F_TSO) { 915 915 sk->sk_route_caps &= ~NETIF_F_TSO; 916 916 sock_set_flag(sk, SOCK_NO_LARGESEND); 917 - tp->mss_cache = tp->mss_cache_std; 917 + tp->mss_cache = tp->mss_cache; 918 918 } 919 919 920 920 if (!tp->sacked_out) ··· 1077 1077 (IsFack(tp) || 1078 1078 !before(lost_retrans, 1079 1079 TCP_SKB_CB(skb)->ack_seq + tp->reordering * 1080 - tp->mss_cache_std))) { 1080 + tp->mss_cache))) { 1081 1081 TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS; 1082 1082 tp->retrans_out -= tcp_skb_pcount(skb); 1083 1083 ··· 1957 1957 } 1958 1958 } 1959 1959 1960 - /* There is one downside to this scheme. Although we keep the 1961 - * ACK clock ticking, adjusting packet counters and advancing 1962 - * congestion window, we do not liberate socket send buffer 1963 - * space. 1964 - * 1965 - * Mucking with skb->truesize and sk->sk_wmem_alloc et al. 1966 - * then making a write space wakeup callback is a possible 1967 - * future enhancement. WARNING: it is not trivial to make. 1968 - */ 1969 1960 static int tcp_tso_acked(struct sock *sk, struct sk_buff *skb, 1970 1961 __u32 now, __s32 *seq_rtt) 1971 1962 { ··· 2038 2047 * the other end. 2039 2048 */ 2040 2049 if (after(scb->end_seq, tp->snd_una)) { 2041 - if (tcp_skb_pcount(skb) > 1) 2050 + if (tcp_skb_pcount(skb) > 1 && 2051 + after(tp->snd_una, scb->seq)) 2042 2052 acked |= tcp_tso_acked(sk, skb, 2043 2053 now, &seq_rtt); 2044 2054 break; ··· 3300 3308 tp->snd_cwnd_stamp = tcp_time_stamp; 3301 3309 } 3302 3310 3311 + static inline int tcp_should_expand_sndbuf(struct sock *sk, struct tcp_sock *tp) 3312 + { 3313 + /* If the user specified a specific send buffer setting, do 3314 + * not modify it. 3315 + */ 3316 + if (sk->sk_userlocks & SOCK_SNDBUF_LOCK) 3317 + return 0; 3318 + 3319 + /* If we are under global TCP memory pressure, do not expand. */ 3320 + if (tcp_memory_pressure) 3321 + return 0; 3322 + 3323 + /* If we are under soft global TCP memory pressure, do not expand. */ 3324 + if (atomic_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0]) 3325 + return 0; 3326 + 3327 + /* If we filled the congestion window, do not expand. */ 3328 + if (tp->packets_out >= tp->snd_cwnd) 3329 + return 0; 3330 + 3331 + return 1; 3332 + } 3303 3333 3304 3334 /* When incoming ACK allowed to free some skb from write_queue, 3305 3335 * we remember this event in flag SOCK_QUEUE_SHRUNK and wake up socket ··· 3333 3319 { 3334 3320 struct tcp_sock *tp = tcp_sk(sk); 3335 3321 3336 - if (tp->packets_out < tp->snd_cwnd && 3337 - !(sk->sk_userlocks & SOCK_SNDBUF_LOCK) && 3338 - !tcp_memory_pressure && 3339 - atomic_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) { 3340 - int sndmem = max_t(u32, tp->rx_opt.mss_clamp, tp->mss_cache_std) + 3322 + if (tcp_should_expand_sndbuf(sk, tp)) { 3323 + int sndmem = max_t(u32, tp->rx_opt.mss_clamp, tp->mss_cache) + 3341 3324 MAX_TCP_HEADER + 16 + sizeof(struct sk_buff), 3342 3325 demanded = max_t(unsigned int, tp->snd_cwnd, 3343 3326 tp->reordering + 1); ··· 3357 3346 } 3358 3347 } 3359 3348 3360 - static void __tcp_data_snd_check(struct sock *sk, struct sk_buff *skb) 3349 + static __inline__ void tcp_data_snd_check(struct sock *sk, struct tcp_sock *tp) 3361 3350 { 3362 - struct tcp_sock *tp = tcp_sk(sk); 3363 - 3364 - if (after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd) || 3365 - tcp_packets_in_flight(tp) >= tp->snd_cwnd || 3366 - tcp_write_xmit(sk, tp->nonagle)) 3367 - tcp_check_probe_timer(sk, tp); 3368 - } 3369 - 3370 - static __inline__ void tcp_data_snd_check(struct sock *sk) 3371 - { 3372 - struct sk_buff *skb = sk->sk_send_head; 3373 - 3374 - if (skb != NULL) 3375 - __tcp_data_snd_check(sk, skb); 3351 + tcp_push_pending_frames(sk, tp); 3376 3352 tcp_check_space(sk); 3377 3353 } 3378 3354 ··· 3653 3655 */ 3654 3656 tcp_ack(sk, skb, 0); 3655 3657 __kfree_skb(skb); 3656 - tcp_data_snd_check(sk); 3658 + tcp_data_snd_check(sk, tp); 3657 3659 return 0; 3658 3660 } else { /* Header too small */ 3659 3661 TCP_INC_STATS_BH(TCP_MIB_INERRS); ··· 3719 3721 if (TCP_SKB_CB(skb)->ack_seq != tp->snd_una) { 3720 3722 /* Well, only one small jumplet in fast path... */ 3721 3723 tcp_ack(sk, skb, FLAG_DATA); 3722 - tcp_data_snd_check(sk); 3724 + tcp_data_snd_check(sk, tp); 3723 3725 if (!tcp_ack_scheduled(tp)) 3724 3726 goto no_ack; 3725 3727 } ··· 3797 3799 /* step 7: process the segment text */ 3798 3800 tcp_data_queue(sk, skb); 3799 3801 3800 - tcp_data_snd_check(sk); 3802 + tcp_data_snd_check(sk, tp); 3801 3803 tcp_ack_snd_check(sk); 3802 3804 return 0; 3803 3805 ··· 4107 4109 /* Do step6 onward by hand. */ 4108 4110 tcp_urg(sk, skb, th); 4109 4111 __kfree_skb(skb); 4110 - tcp_data_snd_check(sk); 4112 + tcp_data_snd_check(sk, tp); 4111 4113 return 0; 4112 4114 } 4113 4115 ··· 4298 4300 4299 4301 /* tcp_data could move socket to TIME-WAIT */ 4300 4302 if (sk->sk_state != TCP_CLOSE) { 4301 - tcp_data_snd_check(sk); 4303 + tcp_data_snd_check(sk, tp); 4302 4304 tcp_ack_snd_check(sk); 4303 4305 } 4304 4306

+1 -1

net/ipv4/tcp_ipv4.c

··· 2045 2045 */ 2046 2046 tp->snd_ssthresh = 0x7fffffff; /* Infinity */ 2047 2047 tp->snd_cwnd_clamp = ~0; 2048 - tp->mss_cache_std = tp->mss_cache = 536; 2048 + tp->mss_cache = 536; 2049 2049 2050 2050 tp->reordering = sysctl_tcp_reordering; 2051 2051 tp->ca_ops = &tcp_init_congestion_ops;

+443 -117

net/ipv4/tcp_output.c

··· 49 49 * will allow a single TSO frame to consume. Building TSO frames 50 50 * which are too large can cause TCP streams to be bursty. 51 51 */ 52 - int sysctl_tcp_tso_win_divisor = 8; 52 + int sysctl_tcp_tso_win_divisor = 3; 53 53 54 54 static inline void update_send_head(struct sock *sk, struct tcp_sock *tp, 55 55 struct sk_buff *skb) ··· 140 140 tp->ack.pingpong = 1; 141 141 } 142 142 143 - static __inline__ void tcp_event_ack_sent(struct sock *sk) 143 + static __inline__ void tcp_event_ack_sent(struct sock *sk, unsigned int pkts) 144 144 { 145 145 struct tcp_sock *tp = tcp_sk(sk); 146 146 147 - tcp_dec_quickack_mode(tp); 147 + tcp_dec_quickack_mode(tp, pkts); 148 148 tcp_clear_xmit_timer(sk, TCP_TIME_DACK); 149 149 } 150 150 ··· 355 355 tp->af_specific->send_check(sk, th, skb->len, skb); 356 356 357 357 if (tcb->flags & TCPCB_FLAG_ACK) 358 - tcp_event_ack_sent(sk); 358 + tcp_event_ack_sent(sk, tcp_skb_pcount(skb)); 359 359 360 360 if (skb->len != tcp_header_size) 361 361 tcp_event_data_sent(tp, skb, sk); ··· 403 403 sk->sk_send_head = skb; 404 404 } 405 405 406 - static inline void tcp_tso_set_push(struct sk_buff *skb) 407 - { 408 - /* Force push to be on for any TSO frames to workaround 409 - * problems with busted implementations like Mac OS-X that 410 - * hold off socket receive wakeups until push is seen. 411 - */ 412 - if (tcp_skb_pcount(skb) > 1) 413 - TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH; 414 - } 415 - 416 - /* Send _single_ skb sitting at the send head. This function requires 417 - * true push pending frames to setup probe timer etc. 418 - */ 419 - void tcp_push_one(struct sock *sk, unsigned cur_mss) 420 - { 421 - struct tcp_sock *tp = tcp_sk(sk); 422 - struct sk_buff *skb = sk->sk_send_head; 423 - 424 - if (tcp_snd_test(sk, skb, cur_mss, TCP_NAGLE_PUSH)) { 425 - /* Send it out now. */ 426 - TCP_SKB_CB(skb)->when = tcp_time_stamp; 427 - tcp_tso_set_push(skb); 428 - if (!tcp_transmit_skb(sk, skb_clone(skb, sk->sk_allocation))) { 429 - sk->sk_send_head = NULL; 430 - tp->snd_nxt = TCP_SKB_CB(skb)->end_seq; 431 - tcp_packets_out_inc(sk, tp, skb); 432 - return; 433 - } 434 - } 435 - } 436 - 437 - void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb) 406 + static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb) 438 407 { 439 408 struct tcp_sock *tp = tcp_sk(sk); 440 409 441 - if (skb->len <= tp->mss_cache_std || 410 + if (skb->len <= tp->mss_cache || 442 411 !(sk->sk_route_caps & NETIF_F_TSO)) { 443 412 /* Avoid the costly divide in the normal 444 413 * non-TSO case. ··· 417 448 } else { 418 449 unsigned int factor; 419 450 420 - factor = skb->len + (tp->mss_cache_std - 1); 421 - factor /= tp->mss_cache_std; 451 + factor = skb->len + (tp->mss_cache - 1); 452 + factor /= tp->mss_cache; 422 453 skb_shinfo(skb)->tso_segs = factor; 423 - skb_shinfo(skb)->tso_size = tp->mss_cache_std; 454 + skb_shinfo(skb)->tso_size = tp->mss_cache; 424 455 } 425 456 } 426 457 ··· 506 537 } 507 538 508 539 /* Link BUFF into the send queue. */ 540 + skb_header_release(buff); 509 541 __skb_append(skb, buff); 510 542 511 543 return 0; ··· 627 657 628 658 /* And store cached results */ 629 659 tp->pmtu_cookie = pmtu; 630 - tp->mss_cache = tp->mss_cache_std = mss_now; 660 + tp->mss_cache = mss_now; 631 661 632 662 return mss_now; 633 663 } ··· 639 669 * cannot be large. However, taking into account rare use of URG, this 640 670 * is not a big flaw. 641 671 */ 642 - 643 - unsigned int tcp_current_mss(struct sock *sk, int large) 672 + unsigned int tcp_current_mss(struct sock *sk, int large_allowed) 644 673 { 645 674 struct tcp_sock *tp = tcp_sk(sk); 646 675 struct dst_entry *dst = __sk_dst_get(sk); 647 - unsigned int do_large, mss_now; 676 + u32 mss_now; 677 + u16 xmit_size_goal; 678 + int doing_tso = 0; 648 679 649 - mss_now = tp->mss_cache_std; 680 + mss_now = tp->mss_cache; 681 + 682 + if (large_allowed && 683 + (sk->sk_route_caps & NETIF_F_TSO) && 684 + !tp->urg_mode) 685 + doing_tso = 1; 686 + 650 687 if (dst) { 651 688 u32 mtu = dst_mtu(dst); 652 689 if (mtu != tp->pmtu_cookie) 653 690 mss_now = tcp_sync_mss(sk, mtu); 654 691 } 655 692 656 - do_large = (large && 657 - (sk->sk_route_caps & NETIF_F_TSO) && 658 - !tp->urg_mode); 659 - 660 - if (do_large) { 661 - unsigned int large_mss, factor, limit; 662 - 663 - large_mss = 65535 - tp->af_specific->net_header_len - 664 - tp->ext_header_len - tp->tcp_header_len; 665 - 666 - if (tp->max_window && large_mss > (tp->max_window>>1)) 667 - large_mss = max((tp->max_window>>1), 668 - 68U - tp->tcp_header_len); 669 - 670 - factor = large_mss / mss_now; 671 - 672 - /* Always keep large mss multiple of real mss, but 673 - * do not exceed 1/tso_win_divisor of the congestion window 674 - * so we can keep the ACK clock ticking and minimize 675 - * bursting. 676 - */ 677 - limit = tp->snd_cwnd; 678 - if (sysctl_tcp_tso_win_divisor) 679 - limit /= sysctl_tcp_tso_win_divisor; 680 - limit = max(1U, limit); 681 - if (factor > limit) 682 - factor = limit; 683 - 684 - tp->mss_cache = mss_now * factor; 685 - 686 - mss_now = tp->mss_cache; 687 - } 688 - 689 693 if (tp->rx_opt.eff_sacks) 690 694 mss_now -= (TCPOLEN_SACK_BASE_ALIGNED + 691 695 (tp->rx_opt.eff_sacks * TCPOLEN_SACK_PERBLOCK)); 696 + 697 + xmit_size_goal = mss_now; 698 + 699 + if (doing_tso) { 700 + xmit_size_goal = 65535 - 701 + tp->af_specific->net_header_len - 702 + tp->ext_header_len - tp->tcp_header_len; 703 + 704 + if (tp->max_window && 705 + (xmit_size_goal > (tp->max_window >> 1))) 706 + xmit_size_goal = max((tp->max_window >> 1), 707 + 68U - tp->tcp_header_len); 708 + 709 + xmit_size_goal -= (xmit_size_goal % mss_now); 710 + } 711 + tp->xmit_size_goal = xmit_size_goal; 712 + 692 713 return mss_now; 714 + } 715 + 716 + /* Congestion window validation. (RFC2861) */ 717 + 718 + static inline void tcp_cwnd_validate(struct sock *sk, struct tcp_sock *tp) 719 + { 720 + __u32 packets_out = tp->packets_out; 721 + 722 + if (packets_out >= tp->snd_cwnd) { 723 + /* Network is feed fully. */ 724 + tp->snd_cwnd_used = 0; 725 + tp->snd_cwnd_stamp = tcp_time_stamp; 726 + } else { 727 + /* Network starves. */ 728 + if (tp->packets_out > tp->snd_cwnd_used) 729 + tp->snd_cwnd_used = tp->packets_out; 730 + 731 + if ((s32)(tcp_time_stamp - tp->snd_cwnd_stamp) >= tp->rto) 732 + tcp_cwnd_application_limited(sk); 733 + } 734 + } 735 + 736 + static unsigned int tcp_window_allows(struct tcp_sock *tp, struct sk_buff *skb, unsigned int mss_now, unsigned int cwnd) 737 + { 738 + u32 window, cwnd_len; 739 + 740 + window = (tp->snd_una + tp->snd_wnd - TCP_SKB_CB(skb)->seq); 741 + cwnd_len = mss_now * cwnd; 742 + return min(window, cwnd_len); 743 + } 744 + 745 + /* Can at least one segment of SKB be sent right now, according to the 746 + * congestion window rules? If so, return how many segments are allowed. 747 + */ 748 + static inline unsigned int tcp_cwnd_test(struct tcp_sock *tp, struct sk_buff *skb) 749 + { 750 + u32 in_flight, cwnd; 751 + 752 + /* Don't be strict about the congestion window for the final FIN. */ 753 + if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) 754 + return 1; 755 + 756 + in_flight = tcp_packets_in_flight(tp); 757 + cwnd = tp->snd_cwnd; 758 + if (in_flight < cwnd) 759 + return (cwnd - in_flight); 760 + 761 + return 0; 762 + } 763 + 764 + /* This must be invoked the first time we consider transmitting 765 + * SKB onto the wire. 766 + */ 767 + static inline int tcp_init_tso_segs(struct sock *sk, struct sk_buff *skb) 768 + { 769 + int tso_segs = tcp_skb_pcount(skb); 770 + 771 + if (!tso_segs) { 772 + tcp_set_skb_tso_segs(sk, skb); 773 + tso_segs = tcp_skb_pcount(skb); 774 + } 775 + return tso_segs; 776 + } 777 + 778 + static inline int tcp_minshall_check(const struct tcp_sock *tp) 779 + { 780 + return after(tp->snd_sml,tp->snd_una) && 781 + !after(tp->snd_sml, tp->snd_nxt); 782 + } 783 + 784 + /* Return 0, if packet can be sent now without violation Nagle's rules: 785 + * 1. It is full sized. 786 + * 2. Or it contains FIN. (already checked by caller) 787 + * 3. Or TCP_NODELAY was set. 788 + * 4. Or TCP_CORK is not set, and all sent packets are ACKed. 789 + * With Minshall's modification: all sent small packets are ACKed. 790 + */ 791 + 792 + static inline int tcp_nagle_check(const struct tcp_sock *tp, 793 + const struct sk_buff *skb, 794 + unsigned mss_now, int nonagle) 795 + { 796 + return (skb->len < mss_now && 797 + ((nonagle&TCP_NAGLE_CORK) || 798 + (!nonagle && 799 + tp->packets_out && 800 + tcp_minshall_check(tp)))); 801 + } 802 + 803 + /* Return non-zero if the Nagle test allows this packet to be 804 + * sent now. 805 + */ 806 + static inline int tcp_nagle_test(struct tcp_sock *tp, struct sk_buff *skb, 807 + unsigned int cur_mss, int nonagle) 808 + { 809 + /* Nagle rule does not apply to frames, which sit in the middle of the 810 + * write_queue (they have no chances to get new data). 811 + * 812 + * This is implemented in the callers, where they modify the 'nonagle' 813 + * argument based upon the location of SKB in the send queue. 814 + */ 815 + if (nonagle & TCP_NAGLE_PUSH) 816 + return 1; 817 + 818 + /* Don't use the nagle rule for urgent data (or for the final FIN). */ 819 + if (tp->urg_mode || 820 + (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)) 821 + return 1; 822 + 823 + if (!tcp_nagle_check(tp, skb, cur_mss, nonagle)) 824 + return 1; 825 + 826 + return 0; 827 + } 828 + 829 + /* Does at least the first segment of SKB fit into the send window? */ 830 + static inline int tcp_snd_wnd_test(struct tcp_sock *tp, struct sk_buff *skb, unsigned int cur_mss) 831 + { 832 + u32 end_seq = TCP_SKB_CB(skb)->end_seq; 833 + 834 + if (skb->len > cur_mss) 835 + end_seq = TCP_SKB_CB(skb)->seq + cur_mss; 836 + 837 + return !after(end_seq, tp->snd_una + tp->snd_wnd); 838 + } 839 + 840 + /* This checks if the data bearing packet SKB (usually sk->sk_send_head) 841 + * should be put on the wire right now. If so, it returns the number of 842 + * packets allowed by the congestion window. 843 + */ 844 + static unsigned int tcp_snd_test(struct sock *sk, struct sk_buff *skb, 845 + unsigned int cur_mss, int nonagle) 846 + { 847 + struct tcp_sock *tp = tcp_sk(sk); 848 + unsigned int cwnd_quota; 849 + 850 + tcp_init_tso_segs(sk, skb); 851 + 852 + if (!tcp_nagle_test(tp, skb, cur_mss, nonagle)) 853 + return 0; 854 + 855 + cwnd_quota = tcp_cwnd_test(tp, skb); 856 + if (cwnd_quota && 857 + !tcp_snd_wnd_test(tp, skb, cur_mss)) 858 + cwnd_quota = 0; 859 + 860 + return cwnd_quota; 861 + } 862 + 863 + static inline int tcp_skb_is_last(const struct sock *sk, 864 + const struct sk_buff *skb) 865 + { 866 + return skb->next == (struct sk_buff *)&sk->sk_write_queue; 867 + } 868 + 869 + int tcp_may_send_now(struct sock *sk, struct tcp_sock *tp) 870 + { 871 + struct sk_buff *skb = sk->sk_send_head; 872 + 873 + return (skb && 874 + tcp_snd_test(sk, skb, tcp_current_mss(sk, 1), 875 + (tcp_skb_is_last(sk, skb) ? 876 + TCP_NAGLE_PUSH : 877 + tp->nonagle))); 878 + } 879 + 880 + /* Trim TSO SKB to LEN bytes, put the remaining data into a new packet 881 + * which is put after SKB on the list. It is very much like 882 + * tcp_fragment() except that it may make several kinds of assumptions 883 + * in order to speed up the splitting operation. In particular, we 884 + * know that all the data is in scatter-gather pages, and that the 885 + * packet has never been sent out before (and thus is not cloned). 886 + */ 887 + static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len) 888 + { 889 + struct sk_buff *buff; 890 + int nlen = skb->len - len; 891 + u16 flags; 892 + 893 + /* All of a TSO frame must be composed of paged data. */ 894 + BUG_ON(skb->len != skb->data_len); 895 + 896 + buff = sk_stream_alloc_pskb(sk, 0, 0, GFP_ATOMIC); 897 + if (unlikely(buff == NULL)) 898 + return -ENOMEM; 899 + 900 + buff->truesize = nlen; 901 + skb->truesize -= nlen; 902 + 903 + /* Correct the sequence numbers. */ 904 + TCP_SKB_CB(buff)->seq = TCP_SKB_CB(skb)->seq + len; 905 + TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq; 906 + TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq; 907 + 908 + /* PSH and FIN should only be set in the second packet. */ 909 + flags = TCP_SKB_CB(skb)->flags; 910 + TCP_SKB_CB(skb)->flags = flags & ~(TCPCB_FLAG_FIN|TCPCB_FLAG_PSH); 911 + TCP_SKB_CB(buff)->flags = flags; 912 + 913 + /* This packet was never sent out yet, so no SACK bits. */ 914 + TCP_SKB_CB(buff)->sacked = 0; 915 + 916 + buff->ip_summed = skb->ip_summed = CHECKSUM_HW; 917 + skb_split(skb, buff, len); 918 + 919 + /* Fix up tso_factor for both original and new SKB. */ 920 + tcp_set_skb_tso_segs(sk, skb); 921 + tcp_set_skb_tso_segs(sk, buff); 922 + 923 + /* Link BUFF into the send queue. */ 924 + skb_header_release(buff); 925 + __skb_append(skb, buff); 926 + 927 + return 0; 928 + } 929 + 930 + /* Try to defer sending, if possible, in order to minimize the amount 931 + * of TSO splitting we do. View it as a kind of TSO Nagle test. 932 + * 933 + * This algorithm is from John Heffner. 934 + */ 935 + static int tcp_tso_should_defer(struct sock *sk, struct tcp_sock *tp, struct sk_buff *skb) 936 + { 937 + u32 send_win, cong_win, limit, in_flight; 938 + 939 + if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) 940 + return 0; 941 + 942 + if (tp->ca_state != TCP_CA_Open) 943 + return 0; 944 + 945 + in_flight = tcp_packets_in_flight(tp); 946 + 947 + BUG_ON(tcp_skb_pcount(skb) <= 1 || 948 + (tp->snd_cwnd <= in_flight)); 949 + 950 + send_win = (tp->snd_una + tp->snd_wnd) - TCP_SKB_CB(skb)->seq; 951 + 952 + /* From in_flight test above, we know that cwnd > in_flight. */ 953 + cong_win = (tp->snd_cwnd - in_flight) * tp->mss_cache; 954 + 955 + limit = min(send_win, cong_win); 956 + 957 + /* If sk_send_head can be sent fully now, just do it. */ 958 + if (skb->len <= limit) 959 + return 0; 960 + 961 + if (sysctl_tcp_tso_win_divisor) { 962 + u32 chunk = min(tp->snd_wnd, tp->snd_cwnd * tp->mss_cache); 963 + 964 + /* If at least some fraction of a window is available, 965 + * just use it. 966 + */ 967 + chunk /= sysctl_tcp_tso_win_divisor; 968 + if (limit >= chunk) 969 + return 0; 970 + } else { 971 + /* Different approach, try not to defer past a single 972 + * ACK. Receiver should ACK every other full sized 973 + * frame, so if we have space for more than 3 frames 974 + * then send now. 975 + */ 976 + if (limit > tcp_max_burst(tp) * tp->mss_cache) 977 + return 0; 978 + } 979 + 980 + /* Ok, it looks like it is advisable to defer. */ 981 + return 1; 693 982 } 694 983 695 984 /* This routine writes packets to the network. It advances the ··· 958 729 * Returns 1, if no segments are in flight and we have queued segments, but 959 730 * cannot send anything now because of SWS or another problem. 960 731 */ 961 - int tcp_write_xmit(struct sock *sk, int nonagle) 732 + static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle) 962 733 { 963 734 struct tcp_sock *tp = tcp_sk(sk); 964 - unsigned int mss_now; 735 + struct sk_buff *skb; 736 + unsigned int tso_segs, sent_pkts; 737 + int cwnd_quota; 965 738 966 739 /* If we are closed, the bytes will have to remain here. 967 740 * In time closedown will finish, we empty the write queue and all 968 741 * will be happy. 969 742 */ 970 - if (sk->sk_state != TCP_CLOSE) { 971 - struct sk_buff *skb; 972 - int sent_pkts = 0; 743 + if (unlikely(sk->sk_state == TCP_CLOSE)) 744 + return 0; 973 745 974 - /* Account for SACKS, we may need to fragment due to this. 975 - * It is just like the real MSS changing on us midstream. 976 - * We also handle things correctly when the user adds some 977 - * IP options mid-stream. Silly to do, but cover it. 978 - */ 979 - mss_now = tcp_current_mss(sk, 1); 746 + skb = sk->sk_send_head; 747 + if (unlikely(!skb)) 748 + return 0; 980 749 981 - while ((skb = sk->sk_send_head) && 982 - tcp_snd_test(sk, skb, mss_now, 983 - tcp_skb_is_last(sk, skb) ? nonagle : 984 - TCP_NAGLE_PUSH)) { 985 - if (skb->len > mss_now) { 986 - if (tcp_fragment(sk, skb, mss_now)) 750 + tso_segs = tcp_init_tso_segs(sk, skb); 751 + cwnd_quota = tcp_cwnd_test(tp, skb); 752 + if (unlikely(!cwnd_quota)) 753 + goto out; 754 + 755 + sent_pkts = 0; 756 + while (likely(tcp_snd_wnd_test(tp, skb, mss_now))) { 757 + BUG_ON(!tso_segs); 758 + 759 + if (tso_segs == 1) { 760 + if (unlikely(!tcp_nagle_test(tp, skb, mss_now, 761 + (tcp_skb_is_last(sk, skb) ? 762 + nonagle : TCP_NAGLE_PUSH)))) 763 + break; 764 + } else { 765 + if (tcp_tso_should_defer(sk, tp, skb)) 766 + break; 767 + } 768 + 769 + if (tso_segs > 1) { 770 + u32 limit = tcp_window_allows(tp, skb, 771 + mss_now, cwnd_quota); 772 + 773 + if (skb->len < limit) { 774 + unsigned int trim = skb->len % mss_now; 775 + 776 + if (trim) 777 + limit = skb->len - trim; 778 + } 779 + if (skb->len > limit) { 780 + if (tso_fragment(sk, skb, limit)) 987 781 break; 988 782 } 989 - 990 - TCP_SKB_CB(skb)->when = tcp_time_stamp; 991 - tcp_tso_set_push(skb); 992 - if (tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC))) 783 + } else if (unlikely(skb->len > mss_now)) { 784 + if (unlikely(tcp_fragment(sk, skb, mss_now))) 993 785 break; 994 - 995 - /* Advance the send_head. This one is sent out. 996 - * This call will increment packets_out. 997 - */ 998 - update_send_head(sk, tp, skb); 999 - 1000 - tcp_minshall_update(tp, mss_now, skb); 1001 - sent_pkts = 1; 1002 786 } 1003 787 1004 - if (sent_pkts) { 1005 - tcp_cwnd_validate(sk, tp); 1006 - return 0; 1007 - } 788 + TCP_SKB_CB(skb)->when = tcp_time_stamp; 1008 789 1009 - return !tp->packets_out && sk->sk_send_head; 790 + if (unlikely(tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC)))) 791 + break; 792 + 793 + /* Advance the send_head. This one is sent out. 794 + * This call will increment packets_out. 795 + */ 796 + update_send_head(sk, tp, skb); 797 + 798 + tcp_minshall_update(tp, mss_now, skb); 799 + sent_pkts++; 800 + 801 + /* Do not optimize this to use tso_segs. If we chopped up 802 + * the packet above, tso_segs will no longer be valid. 803 + */ 804 + cwnd_quota -= tcp_skb_pcount(skb); 805 + 806 + BUG_ON(cwnd_quota < 0); 807 + if (!cwnd_quota) 808 + break; 809 + 810 + skb = sk->sk_send_head; 811 + if (!skb) 812 + break; 813 + tso_segs = tcp_init_tso_segs(sk, skb); 1010 814 } 1011 - return 0; 815 + 816 + if (likely(sent_pkts)) { 817 + tcp_cwnd_validate(sk, tp); 818 + return 0; 819 + } 820 + out: 821 + return !tp->packets_out && sk->sk_send_head; 822 + } 823 + 824 + /* Push out any pending frames which were held back due to 825 + * TCP_CORK or attempt at coalescing tiny packets. 826 + * The socket must be locked by the caller. 827 + */ 828 + void __tcp_push_pending_frames(struct sock *sk, struct tcp_sock *tp, 829 + unsigned int cur_mss, int nonagle) 830 + { 831 + struct sk_buff *skb = sk->sk_send_head; 832 + 833 + if (skb) { 834 + if (tcp_write_xmit(sk, cur_mss, nonagle)) 835 + tcp_check_probe_timer(sk, tp); 836 + } 837 + } 838 + 839 + /* Send _single_ skb sitting at the send head. This function requires 840 + * true push pending frames to setup probe timer etc. 841 + */ 842 + void tcp_push_one(struct sock *sk, unsigned int mss_now) 843 + { 844 + struct tcp_sock *tp = tcp_sk(sk); 845 + struct sk_buff *skb = sk->sk_send_head; 846 + unsigned int tso_segs, cwnd_quota; 847 + 848 + BUG_ON(!skb || skb->len < mss_now); 849 + 850 + tso_segs = tcp_init_tso_segs(sk, skb); 851 + cwnd_quota = tcp_snd_test(sk, skb, mss_now, TCP_NAGLE_PUSH); 852 + 853 + if (likely(cwnd_quota)) { 854 + BUG_ON(!tso_segs); 855 + 856 + if (tso_segs > 1) { 857 + u32 limit = tcp_window_allows(tp, skb, 858 + mss_now, cwnd_quota); 859 + 860 + if (skb->len < limit) { 861 + unsigned int trim = skb->len % mss_now; 862 + 863 + if (trim) 864 + limit = skb->len - trim; 865 + } 866 + if (skb->len > limit) { 867 + if (unlikely(tso_fragment(sk, skb, limit))) 868 + return; 869 + } 870 + } else if (unlikely(skb->len > mss_now)) { 871 + if (unlikely(tcp_fragment(sk, skb, mss_now))) 872 + return; 873 + } 874 + 875 + /* Send it out now. */ 876 + TCP_SKB_CB(skb)->when = tcp_time_stamp; 877 + 878 + if (likely(!tcp_transmit_skb(sk, skb_clone(skb, sk->sk_allocation)))) { 879 + update_send_head(sk, tp, skb); 880 + tcp_cwnd_validate(sk, tp); 881 + return; 882 + } 883 + } 1012 884 } 1013 885 1014 886 /* This function returns the amount that we can raise the ··· 1369 1039 if (sk->sk_route_caps & NETIF_F_TSO) { 1370 1040 sk->sk_route_caps &= ~NETIF_F_TSO; 1371 1041 sock_set_flag(sk, SOCK_NO_LARGESEND); 1372 - tp->mss_cache = tp->mss_cache_std; 1373 1042 } 1374 1043 1375 1044 if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq)) ··· 1430 1101 * is still in somebody's hands, else make a clone. 1431 1102 */ 1432 1103 TCP_SKB_CB(skb)->when = tcp_time_stamp; 1433 - tcp_tso_set_push(skb); 1434 1104 1435 1105 err = tcp_transmit_skb(sk, (skb_cloned(skb) ? 1436 1106 pskb_copy(skb, GFP_ATOMIC): ··· 1998 1670 if (sk->sk_route_caps & NETIF_F_TSO) { 1999 1671 sock_set_flag(sk, SOCK_NO_LARGESEND); 2000 1672 sk->sk_route_caps &= ~NETIF_F_TSO; 2001 - tp->mss_cache = tp->mss_cache_std; 2002 1673 } 2003 1674 } else if (!tcp_skb_pcount(skb)) 2004 1675 tcp_set_skb_tso_segs(sk, skb); 2005 1676 2006 1677 TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH; 2007 1678 TCP_SKB_CB(skb)->when = tcp_time_stamp; 2008 - tcp_tso_set_push(skb); 2009 1679 err = tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC)); 2010 1680 if (!err) { 2011 1681 update_send_head(sk, tp, skb);

+2 -2

net/ipv6/af_inet6.c

··· 774 774 if (if6_proc_init()) 775 775 goto proc_if6_fail; 776 776 #endif 777 - ipv6_packet_init(); 778 777 ip6_route_init(); 779 778 ip6_flowlabel_init(); 780 779 err = addrconf_init(); ··· 790 791 /* Init v6 transport protocols. */ 791 792 udpv6_init(); 792 793 tcpv6_init(); 794 + 795 + ipv6_packet_init(); 793 796 err = 0; 794 797 out: 795 798 return err; ··· 799 798 addrconf_fail: 800 799 ip6_flowlabel_cleanup(); 801 800 ip6_route_cleanup(); 802 - ipv6_packet_cleanup(); 803 801 #ifdef CONFIG_PROC_FS 804 802 if6_proc_exit(); 805 803 proc_if6_fail:

-1

net/ipv6/ip6_output.c

··· 465 465 to->pkt_type = from->pkt_type; 466 466 to->priority = from->priority; 467 467 to->protocol = from->protocol; 468 - to->security = from->security; 469 468 dst_release(to->dst); 470 469 to->dst = dst_clone(from->dst); 471 470 to->dev = from->dev;

+1 -1

net/ipv6/tcp_ipv6.c

··· 2018 2018 */ 2019 2019 tp->snd_ssthresh = 0x7fffffff; 2020 2020 tp->snd_cwnd_clamp = ~0; 2021 - tp->mss_cache_std = tp->mss_cache = 536; 2021 + tp->mss_cache = 536; 2022 2022 2023 2023 tp->reordering = sysctl_tcp_reordering; 2024 2024

+1 -1

net/sched/Makefile

··· 4 4 5 5 obj-y := sch_generic.o 6 6 7 - obj-$(CONFIG_NET_SCHED) += sch_api.o sch_fifo.o 7 + obj-$(CONFIG_NET_SCHED) += sch_api.o sch_fifo.o sch_blackhole.o 8 8 obj-$(CONFIG_NET_CLS) += cls_api.o 9 9 obj-$(CONFIG_NET_CLS_ACT) += act_api.o 10 10 obj-$(CONFIG_NET_ACT_POLICE) += police.o

-6

net/sched/em_meta.c

··· 205 205 dst->value = skb->protocol; 206 206 } 207 207 208 - META_COLLECTOR(int_security) 209 - { 210 - dst->value = skb->security; 211 - } 212 - 213 208 META_COLLECTOR(int_pkttype) 214 209 { 215 210 dst->value = skb->pkt_type; ··· 519 524 [META_ID(REALDEV)] = META_FUNC(int_realdev), 520 525 [META_ID(PRIORITY)] = META_FUNC(int_priority), 521 526 [META_ID(PROTOCOL)] = META_FUNC(int_protocol), 522 - [META_ID(SECURITY)] = META_FUNC(int_security), 523 527 [META_ID(PKTTYPE)] = META_FUNC(int_pkttype), 524 528 [META_ID(PKTLEN)] = META_FUNC(int_pktlen), 525 529 [META_ID(DATALEN)] = META_FUNC(int_datalen),

+26 -37

net/sched/sch_api.c

··· 399 399 { 400 400 int err; 401 401 struct rtattr *kind = tca[TCA_KIND-1]; 402 - void *p = NULL; 403 402 struct Qdisc *sch; 404 403 struct Qdisc_ops *ops; 405 - int size; 406 404 407 405 ops = qdisc_lookup_ops(kind); 408 406 #ifdef CONFIG_KMOD ··· 435 437 if (ops == NULL) 436 438 goto err_out; 437 439 438 - /* ensure that the Qdisc and the private data are 32-byte aligned */ 439 - size = ((sizeof(*sch) + QDISC_ALIGN_CONST) & ~QDISC_ALIGN_CONST); 440 - size += ops->priv_size + QDISC_ALIGN_CONST; 441 - 442 - p = kmalloc(size, GFP_KERNEL); 443 - err = -ENOBUFS; 444 - if (!p) 440 + sch = qdisc_alloc(dev, ops); 441 + if (IS_ERR(sch)) { 442 + err = PTR_ERR(sch); 445 443 goto err_out2; 446 - memset(p, 0, size); 447 - sch = (struct Qdisc *)(((unsigned long)p + QDISC_ALIGN_CONST) 448 - & ~QDISC_ALIGN_CONST); 449 - sch->padded = (char *)sch - (char *)p; 444 + } 450 445 451 - INIT_LIST_HEAD(&sch->list); 452 - skb_queue_head_init(&sch->q); 453 - 454 - if (handle == TC_H_INGRESS) 446 + if (handle == TC_H_INGRESS) { 455 447 sch->flags |= TCQ_F_INGRESS; 456 - 457 - sch->ops = ops; 458 - sch->enqueue = ops->enqueue; 459 - sch->dequeue = ops->dequeue; 460 - sch->dev = dev; 461 - dev_hold(dev); 462 - atomic_set(&sch->refcnt, 1); 463 - sch->stats_lock = &dev->queue_lock; 464 - if (handle == 0) { 448 + handle = TC_H_MAKE(TC_H_INGRESS, 0); 449 + } else if (handle == 0) { 465 450 handle = qdisc_alloc_handle(dev); 466 451 err = -ENOMEM; 467 452 if (handle == 0) 468 453 goto err_out3; 469 454 } 470 455 471 - if (handle == TC_H_INGRESS) 472 - sch->handle =TC_H_MAKE(TC_H_INGRESS, 0); 473 - else 474 - sch->handle = handle; 456 + sch->handle = handle; 475 457 476 458 if (!ops->init || (err = ops->init(sch, tca[TCA_OPTIONS-1])) == 0) { 459 + #ifdef CONFIG_NET_ESTIMATOR 460 + if (tca[TCA_RATE-1]) { 461 + err = gen_new_estimator(&sch->bstats, &sch->rate_est, 462 + sch->stats_lock, 463 + tca[TCA_RATE-1]); 464 + if (err) { 465 + /* 466 + * Any broken qdiscs that would require 467 + * a ops->reset() here? The qdisc was never 468 + * in action so it shouldn't be necessary. 469 + */ 470 + if (ops->destroy) 471 + ops->destroy(sch); 472 + goto err_out3; 473 + } 474 + } 475 + #endif 477 476 qdisc_lock_tree(dev); 478 477 list_add_tail(&sch->list, &dev->qdisc_list); 479 478 qdisc_unlock_tree(dev); 480 479 481 - #ifdef CONFIG_NET_ESTIMATOR 482 - if (tca[TCA_RATE-1]) 483 - gen_new_estimator(&sch->bstats, &sch->rate_est, 484 - sch->stats_lock, tca[TCA_RATE-1]); 485 - #endif 486 480 return sch; 487 481 } 488 482 err_out3: 489 483 dev_put(dev); 484 + kfree((char *) sch - sch->padded); 490 485 err_out2: 491 486 module_put(ops->owner); 492 487 err_out: 493 488 *errp = err; 494 - if (p) 495 - kfree(p); 496 489 return NULL; 497 490 } 498 491

+54

net/sched/sch_blackhole.c

··· 1 + /* 2 + * net/sched/sch_blackhole.c Black hole queue 3 + * 4 + * This program is free software; you can redistribute it and/or 5 + * modify it under the terms of the GNU General Public License 6 + * as published by the Free Software Foundation; either version 7 + * 2 of the License, or (at your option) any later version. 8 + * 9 + * Authors: Thomas Graf <tgraf@suug.ch> 10 + * 11 + * Note: Quantum tunneling is not supported. 12 + */ 13 + 14 + #include <linux/config.h> 15 + #include <linux/module.h> 16 + #include <linux/types.h> 17 + #include <linux/kernel.h> 18 + #include <linux/netdevice.h> 19 + #include <linux/skbuff.h> 20 + #include <net/pkt_sched.h> 21 + 22 + static int blackhole_enqueue(struct sk_buff *skb, struct Qdisc *sch) 23 + { 24 + qdisc_drop(skb, sch); 25 + return NET_XMIT_SUCCESS; 26 + } 27 + 28 + static struct sk_buff *blackhole_dequeue(struct Qdisc *sch) 29 + { 30 + return NULL; 31 + } 32 + 33 + static struct Qdisc_ops blackhole_qdisc_ops = { 34 + .id = "blackhole", 35 + .priv_size = 0, 36 + .enqueue = blackhole_enqueue, 37 + .dequeue = blackhole_dequeue, 38 + .owner = THIS_MODULE, 39 + }; 40 + 41 + static int __init blackhole_module_init(void) 42 + { 43 + return register_qdisc(&blackhole_qdisc_ops); 44 + } 45 + 46 + static void __exit blackhole_module_exit(void) 47 + { 48 + unregister_qdisc(&blackhole_qdisc_ops); 49 + } 50 + 51 + module_init(blackhole_module_init) 52 + module_exit(blackhole_module_exit) 53 + 54 + MODULE_LICENSE("GPL");

+24 -11

net/sched/sch_generic.c

··· 395 395 .owner = THIS_MODULE, 396 396 }; 397 397 398 - struct Qdisc * qdisc_create_dflt(struct net_device *dev, struct Qdisc_ops *ops) 398 + struct Qdisc *qdisc_alloc(struct net_device *dev, struct Qdisc_ops *ops) 399 399 { 400 400 void *p; 401 401 struct Qdisc *sch; 402 - int size; 402 + unsigned int size; 403 + int err = -ENOBUFS; 403 404 404 405 /* ensure that the Qdisc and the private data are 32-byte aligned */ 405 - size = ((sizeof(*sch) + QDISC_ALIGN_CONST) & ~QDISC_ALIGN_CONST); 406 - size += ops->priv_size + QDISC_ALIGN_CONST; 406 + size = QDISC_ALIGN(sizeof(*sch)); 407 + size += ops->priv_size + (QDISC_ALIGNTO - 1); 407 408 408 409 p = kmalloc(size, GFP_KERNEL); 409 410 if (!p) 410 - return NULL; 411 + goto errout; 411 412 memset(p, 0, size); 412 - 413 - sch = (struct Qdisc *)(((unsigned long)p + QDISC_ALIGN_CONST) 414 - & ~QDISC_ALIGN_CONST); 415 - sch->padded = (char *)sch - (char *)p; 413 + sch = (struct Qdisc *) QDISC_ALIGN((unsigned long) p); 414 + sch->padded = (char *) sch - (char *) p; 416 415 417 416 INIT_LIST_HEAD(&sch->list); 418 417 skb_queue_head_init(&sch->q); ··· 422 423 dev_hold(dev); 423 424 sch->stats_lock = &dev->queue_lock; 424 425 atomic_set(&sch->refcnt, 1); 426 + 427 + return sch; 428 + errout: 429 + return ERR_PTR(-err); 430 + } 431 + 432 + struct Qdisc * qdisc_create_dflt(struct net_device *dev, struct Qdisc_ops *ops) 433 + { 434 + struct Qdisc *sch; 435 + 436 + sch = qdisc_alloc(dev, ops); 437 + if (IS_ERR(sch)) 438 + goto errout; 439 + 425 440 if (!ops->init || ops->init(sch, NULL) == 0) 426 441 return sch; 427 442 428 - dev_put(dev); 429 - kfree(p); 443 + errout: 430 444 return NULL; 431 445 } 432 446 ··· 603 591 EXPORT_SYMBOL(noop_qdisc); 604 592 EXPORT_SYMBOL(noop_qdisc_ops); 605 593 EXPORT_SYMBOL(qdisc_create_dflt); 594 + EXPORT_SYMBOL(qdisc_alloc); 606 595 EXPORT_SYMBOL(qdisc_destroy); 607 596 EXPORT_SYMBOL(qdisc_reset); 608 597 EXPORT_SYMBOL(qdisc_restart);

Configure Feed

Configure Feed