tcp: enforce receive buffer memory limits by allowing the tcp window to shrink

Under certain circumstances, the tcp receive buffer memory limit
set by autotuning (sk_rcvbuf) is increased due to incoming data
packets as a result of the window not closing when it should be.
This can result in the receive buffer growing all the way up to
tcp_rmem[2], even for tcp sessions with a low BDP.

To reproduce: Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()). This will cause the tcp receive buffer
to grow all the way up to tcp_rmem[2].

As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.

The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender. This problem has previously been identified in
RFC 7323, appendix F [1].

The Linux kernel currently adheres to never shrinking the window.

In addition to the overallocation of memory mentioned above, the
current behavior is functionally incorrect, because once tcp_rmem[2]
is reached when no remediations remain (i.e. tcp collapse fails to
free up any more memory and there are no packets to prune from the
out-of-order queue), the receiver will drop in-window packets
resulting in retransmissions and an eventual timeout of the tcp
session. A receive buffer full condition should instead result
in a zero window and an indefinite wait.

In practice, this problem is largely hidden for most flows. It
is not applicable to mice flows. Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.

But this problem does show up for other types of flows. Examples
are websockets and other type of flows that send small amounts of
data spaced apart slightly in time. In these cases, we directly
encounter the problem described in [1].

RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3]. All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].

This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf). This new functionality is enabled with
the new sysctl: net.ipv4.tcp_shrink_window

Additional information can be found at:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323

Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

mfreemon@cloudflare.com and committed by

David S. Miller 3 years ago b650d953 a52305a8

+78 -9

5 changed files

expand all

Documentation

networking

ip-sysctl.rst

include

net

netns

ipv4.h

net

ipv4

sysctl_net_ipv4.c

tcp_ipv4.c

tcp_output.c

+15

Documentation/networking/ip-sysctl.rst

··· 981 981 tcp_window_scaling - BOOLEAN 982 982 Enable window scaling as defined in RFC1323. 983 983 984 + tcp_shrink_window - BOOLEAN 985 + This changes how the TCP receive window is calculated. 986 + 987 + RFC 7323, section 2.4, says there are instances when a retracted 988 + window can be offered, and that TCP implementations MUST ensure 989 + that they handle a shrinking window, as specified in RFC 1122. 990 + 991 + - 0 - Disabled. The window is never shrunk. 992 + - 1 - Enabled. The window is shrunk when necessary to remain within 993 + the memory limit set by autotuning (sk_rcvbuf). 994 + This only occurs if a non-zero receive window 995 + scaling factor is also in effect. 996 + 997 + Default: 0 998 + 984 999 tcp_wmem - vector of 3 INTEGERs: min, default, max 985 1000 min: Amount of memory reserved for send buffers for TCP sockets. 986 1001 Each TCP socket has rights to use it due to fact of its birth.

include/net/netns/ipv4.h

··· 65 65 #endif 66 66 bool fib_has_custom_local_routes; 67 67 bool fib_offload_disabled; 68 + u8 sysctl_tcp_shrink_window; 68 69 #ifdef CONFIG_IP_ROUTE_CLASSID 69 70 atomic_t fib_num_tclassid_users; 70 71 #endif

net/ipv4/sysctl_net_ipv4.c

··· 1480 1480 .extra1 = SYSCTL_ZERO, 1481 1481 .extra2 = &tcp_syn_linear_timeouts_max, 1482 1482 }, 1483 + { 1484 + .procname = "tcp_shrink_window", 1485 + .data = &init_net.ipv4.sysctl_tcp_shrink_window, 1486 + .maxlen = sizeof(u8), 1487 + .mode = 0644, 1488 + .proc_handler = proc_dou8vec_minmax, 1489 + .extra1 = SYSCTL_ZERO, 1490 + .extra2 = SYSCTL_ONE, 1491 + }, 1483 1492 { } 1484 1493 }; 1485 1494

net/ipv4/tcp_ipv4.c

··· 3281 3281 net->ipv4.tcp_congestion_control = &tcp_reno; 3282 3282 3283 3283 net->ipv4.sysctl_tcp_syn_linear_timeouts = 4; 3284 + net->ipv4.sysctl_tcp_shrink_window = 0; 3285 + 3284 3286 return 0; 3285 3287 } 3286 3288

+51 -9

net/ipv4/tcp_output.c

··· 260 260 u32 old_win = tp->rcv_wnd; 261 261 u32 cur_win = tcp_receive_window(tp); 262 262 u32 new_win = __tcp_select_window(sk); 263 + struct net *net = sock_net(sk); 263 264 264 - /* Never shrink the offered window */ 265 265 if (new_win < cur_win) { 266 266 /* Danger Will Robinson! 267 267 * Don't update rcv_wup/rcv_wnd here or else ··· 270 270 * 271 271 * Relax Will Robinson. 272 272 */ 273 - if (new_win == 0) 274 - NET_INC_STATS(sock_net(sk), 275 - LINUX_MIB_TCPWANTZEROWINDOWADV); 276 - new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale); 273 + if (!READ_ONCE(net->ipv4.sysctl_tcp_shrink_window) || !tp->rx_opt.rcv_wscale) { 274 + /* Never shrink the offered window */ 275 + if (new_win == 0) 276 + NET_INC_STATS(net, LINUX_MIB_TCPWANTZEROWINDOWADV); 277 + new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale); 278 + } 277 279 } 280 + 278 281 tp->rcv_wnd = new_win; 279 282 tp->rcv_wup = tp->rcv_nxt; 280 283 ··· 285 282 * scaled window. 286 283 */ 287 284 if (!tp->rx_opt.rcv_wscale && 288 - READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) 285 + READ_ONCE(net->ipv4.sysctl_tcp_workaround_signed_windows)) 289 286 new_win = min(new_win, MAX_TCP_WINDOW); 290 287 else 291 288 new_win = min(new_win, (65535U << tp->rx_opt.rcv_wscale)); ··· 297 294 if (new_win == 0) { 298 295 tp->pred_flags = 0; 299 296 if (old_win) 300 - NET_INC_STATS(sock_net(sk), 301 - LINUX_MIB_TCPTOZEROWINDOWADV); 297 + NET_INC_STATS(net, LINUX_MIB_TCPTOZEROWINDOWADV); 302 298 } else if (old_win == 0) { 303 - NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFROMZEROWINDOWADV); 299 + NET_INC_STATS(net, LINUX_MIB_TCPFROMZEROWINDOWADV); 304 300 } 305 301 306 302 return new_win; ··· 2989 2987 { 2990 2988 struct inet_connection_sock *icsk = inet_csk(sk); 2991 2989 struct tcp_sock *tp = tcp_sk(sk); 2990 + struct net *net = sock_net(sk); 2992 2991 /* MSS for the peer's data. Previous versions used mss_clamp 2993 2992 * here. I don't know if the value based on our guesses 2994 2993 * of peer's MSS is better for the performance. It's more correct ··· 3011 3008 if (mss <= 0) 3012 3009 return 0; 3013 3010 } 3011 + 3012 + /* Only allow window shrink if the sysctl is enabled and we have 3013 + * a non-zero scaling factor in effect. 3014 + */ 3015 + if (READ_ONCE(net->ipv4.sysctl_tcp_shrink_window) && tp->rx_opt.rcv_wscale) 3016 + goto shrink_window_allowed; 3017 + 3018 + /* do not allow window to shrink */ 3019 + 3014 3020 if (free_space < (full_space >> 1)) { 3015 3021 icsk->icsk_ack.quick = 0; 3016 3022 ··· 3074 3062 } 3075 3063 3076 3064 return window; 3065 + 3066 + shrink_window_allowed: 3067 + /* new window should always be an exact multiple of scaling factor */ 3068 + free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale); 3069 + 3070 + if (free_space < (full_space >> 1)) { 3071 + icsk->icsk_ack.quick = 0; 3072 + 3073 + if (tcp_under_memory_pressure(sk)) 3074 + tcp_adjust_rcv_ssthresh(sk); 3075 + 3076 + /* if free space is too low, return a zero window */ 3077 + if (free_space < (allowed_space >> 4) || free_space < mss || 3078 + free_space < (1 << tp->rx_opt.rcv_wscale)) 3079 + return 0; 3080 + } 3081 + 3082 + if (free_space > tp->rcv_ssthresh) { 3083 + free_space = tp->rcv_ssthresh; 3084 + /* new window should always be an exact multiple of scaling factor 3085 + * 3086 + * For this case, we ALIGN "up" (increase free_space) because 3087 + * we know free_space is not zero here, it has been reduced from 3088 + * the memory-based limit, and rcv_ssthresh is not a hard limit 3089 + * (unlike sk_rcvbuf). 3090 + */ 3091 + free_space = ALIGN(free_space, (1 << tp->rx_opt.rcv_wscale)); 3092 + } 3093 + 3094 + return free_space; 3077 3095 } 3078 3096 3079 3097 void tcp_skb_collapse_tstamp(struct sk_buff *skb,

Configure Feed

Configure Feed