Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge branch 'geneve-introduce-double-tunnel-gso-gro-support'

Paolo Abeni says:

====================
geneve: introduce double tunnel GSO/GRO support

This is the [belated] incarnation of topic discussed in the last Neconf
[1].

In container orchestration in virtual environments there is a consistent
usage of double UDP tunneling - specifically geneve. Such setup lack
support of GRO and GSO for inter VM traffic.

After commit b430f6c38da6 ("Merge branch 'virtio_udp_tunnel_08_07_2025'
of https://github.com/pabeni/linux-devel") and the qemu cunter-part, VMs
are able to send/receive GSO over UDP aggregated packets.

This series introduces the missing bit for full end-to-end aggregation
in the above mentioned scenario. Specifically:

- introduces a new netdev feature set to generalize existing per device
driver GSO admission check.1
- adds GSO partial support for the geneve and vxlan drivers
- introduces and use a geneve option to assist double tunnel GRO
- adds some simple functional tests for the above.

The new device features set is not strictly needed for the following
work, but avoids the introduction of trivial `ndo_features_check` to
support GSO partial and thus possible performance regression due to the
additional indirect call. Such feature set could be leveraged by a
number of existing drivers (intel, meta and possibly wangxun) to avoid
duplicate code/tests. Such part has been omitted here to keep the series
small.

Both GSO partial support and double GRO support have some downsides.
With the first in place, GSO partial packets will traverse the network
stack 'downstream' the outer geneve UDP tunnel and will be visible by
the udp/IP/IPv6 and by netfilter. Currently only H/W NICs implement GSO
partial support and such packets are visible only via software taps.

Double UDP tunnel GRO will cook 'GSO partial' like aggregate packets,
i.e. the inner UDP encapsulation headers set will still carry the
wire-level lengths and csum, so that segmentation considering such
headers parts of a giant, constant encapsulation header will yield the
correct result.

The correct GSO packet layout is applied when the packet traverse the
outermost geneve encapsulation.

Both GSO partial and double UDP encap are disabled by default and must
be explicitly enabled via, respectively ethtool and geneve device
configuration.

Finally note that the GSO partial feature could potentially be applied
to all the other UDP tunnels, but this series limits its usage to geneve
and vxlan devices.

Link: https://netdev.bots.linux.dev/netconf/2024/paolo.pdf [1]
====================

Link: https://patch.msgid.link/cover.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

+975 -39
+3
Documentation/netlink/specs/rt-link.yaml
··· 1914 1914 name: port-range 1915 1915 type: binary 1916 1916 struct: ifla-geneve-port-range 1917 + - 1918 + name: gro-hint 1919 + type: flag 1917 1920 - 1918 1921 name: linkinfo-hsr-attrs 1919 1922 name-prefix: ifla-hsr-
+523 -34
drivers/net/geneve.c
··· 38 38 #define GENEVE_IPV4_HLEN (ETH_HLEN + sizeof(struct iphdr) + GENEVE_BASE_HLEN) 39 39 #define GENEVE_IPV6_HLEN (ETH_HLEN + sizeof(struct ipv6hdr) + GENEVE_BASE_HLEN) 40 40 41 + #define GENEVE_OPT_NETDEV_CLASS 0x100 42 + #define GENEVE_OPT_GRO_HINT_SIZE 8 43 + #define GENEVE_OPT_GRO_HINT_TYPE 1 44 + #define GENEVE_OPT_GRO_HINT_LEN 1 45 + 46 + struct geneve_opt_gro_hint { 47 + u8 inner_proto_id:2, 48 + nested_is_v6:1; 49 + u8 nested_nh_offset; 50 + u8 nested_tp_offset; 51 + u8 nested_hdr_len; 52 + }; 53 + 54 + struct geneve_skb_cb { 55 + unsigned int gro_hint_len; 56 + struct geneve_opt_gro_hint gro_hint; 57 + }; 58 + 59 + #define GENEVE_SKB_CB(__skb) ((struct geneve_skb_cb *)&((__skb)->cb[0])) 60 + 41 61 /* per-network namespace private data for this module */ 42 62 struct geneve_net { 43 63 struct list_head geneve_list; ··· 76 56 bool collect_md; 77 57 bool use_udp6_rx_checksums; 78 58 bool ttl_inherit; 59 + bool gro_hint; 79 60 enum ifla_geneve_df df; 80 61 bool inner_proto_inherit; 81 62 u16 port_min; ··· 105 84 106 85 struct geneve_sock { 107 86 bool collect_md; 87 + bool gro_hint; 108 88 struct list_head list; 109 89 struct socket *sock; 110 90 struct rcu_head rcu; 111 91 int refcnt; 112 92 struct hlist_head vni_list[VNI_HASH_SIZE]; 113 93 }; 94 + 95 + static const __be16 proto_id_map[] = { htons(ETH_P_TEB), 96 + htons(ETH_P_IPV6), 97 + htons(ETH_P_IP) }; 98 + 99 + static int proto_to_id(__be16 proto) 100 + { 101 + int i; 102 + 103 + for (i = 0; i < ARRAY_SIZE(proto_id_map); i++) 104 + if (proto_id_map[i] == proto) 105 + return i; 106 + 107 + return -1; 108 + } 114 109 115 110 static inline __u32 geneve_net_vni_hash(u8 vni[3]) 116 111 { ··· 259 222 260 223 /* geneve receive/decap routine */ 261 224 static void geneve_rx(struct geneve_dev *geneve, struct geneve_sock *gs, 262 - struct sk_buff *skb) 225 + struct sk_buff *skb, const struct genevehdr *gnvh) 263 226 { 264 - struct genevehdr *gnvh = geneve_hdr(skb); 265 227 struct metadata_dst *tun_dst = NULL; 266 228 unsigned int len; 267 229 int nh, err = 0; ··· 361 325 } 362 326 } 363 327 328 + /* Skip the additional GRO stage when hints are in use. */ 364 329 len = skb->len; 365 - err = gro_cells_receive(&geneve->gro_cells, skb); 330 + if (skb->encapsulation) 331 + err = netif_rx(skb); 332 + else 333 + err = gro_cells_receive(&geneve->gro_cells, skb); 366 334 if (likely(err == NET_RX_SUCCESS)) 367 335 dev_dstats_rx_add(geneve->dev, len); 368 336 ··· 401 361 402 362 dst_cache_destroy(&geneve->cfg.info.dst_cache); 403 363 gro_cells_destroy(&geneve->gro_cells); 364 + } 365 + 366 + static int geneve_hlen(const struct genevehdr *gh) 367 + { 368 + return sizeof(*gh) + gh->opt_len * 4; 369 + } 370 + 371 + /* 372 + * Look for GRO hint in the genenve options; if not found or does not pass basic 373 + * sanitization return 0, otherwise the offset WRT the geneve hdr start. 374 + */ 375 + static unsigned int 376 + geneve_opt_gro_hint_off(const struct genevehdr *gh, __be16 *type, 377 + unsigned int *gh_len) 378 + { 379 + struct geneve_opt *opt = (void *)(gh + 1); 380 + unsigned int id, opt_len = gh->opt_len; 381 + struct geneve_opt_gro_hint *gro_hint; 382 + 383 + while (opt_len >= (GENEVE_OPT_GRO_HINT_SIZE >> 2)) { 384 + if (opt->opt_class == htons(GENEVE_OPT_NETDEV_CLASS) && 385 + opt->type == GENEVE_OPT_GRO_HINT_TYPE && 386 + opt->length == GENEVE_OPT_GRO_HINT_LEN) 387 + goto found; 388 + 389 + /* check for bad opt len */ 390 + if (opt->length + 1 >= opt_len) 391 + return 0; 392 + 393 + /* next opt */ 394 + opt_len -= opt->length + 1; 395 + opt = ((void *)opt) + ((opt->length + 1) << 2); 396 + } 397 + return 0; 398 + 399 + found: 400 + gro_hint = (struct geneve_opt_gro_hint *)opt->opt_data; 401 + 402 + /* 403 + * Sanitize the hinted hdrs: the nested transport is UDP and must fit 404 + * the overall hinted hdr size. 405 + */ 406 + if (gro_hint->nested_tp_offset + sizeof(struct udphdr) > 407 + gro_hint->nested_hdr_len) 408 + return 0; 409 + 410 + if (gro_hint->nested_nh_offset + 411 + (gro_hint->nested_is_v6 ? sizeof(struct ipv6hdr) : 412 + sizeof(struct iphdr)) > 413 + gro_hint->nested_tp_offset) 414 + return 0; 415 + 416 + /* Allow only supported L2. */ 417 + id = gro_hint->inner_proto_id; 418 + if (id >= ARRAY_SIZE(proto_id_map)) 419 + return 0; 420 + 421 + *type = proto_id_map[id]; 422 + *gh_len += gro_hint->nested_hdr_len; 423 + 424 + return (void *)gro_hint - (void *)gh; 425 + } 426 + 427 + static const struct geneve_opt_gro_hint * 428 + geneve_opt_gro_hint(const struct genevehdr *gh, unsigned int hint_off) 429 + { 430 + return (const struct geneve_opt_gro_hint *)((void *)gh + hint_off); 431 + } 432 + 433 + static unsigned int 434 + geneve_sk_gro_hint_off(const struct sock *sk, const struct genevehdr *gh, 435 + __be16 *type, unsigned int *gh_len) 436 + { 437 + const struct geneve_sock *gs = rcu_dereference_sk_user_data(sk); 438 + 439 + if (!gs || !gs->gro_hint) 440 + return 0; 441 + return geneve_opt_gro_hint_off(gh, type, gh_len); 442 + } 443 + 444 + /* Validate the packet headers pointed by data WRT the provided hint */ 445 + static bool 446 + geneve_opt_gro_hint_validate(void *data, 447 + const struct geneve_opt_gro_hint *gro_hint) 448 + { 449 + void *nested_nh = data + gro_hint->nested_nh_offset; 450 + struct iphdr *iph; 451 + 452 + if (gro_hint->nested_is_v6) { 453 + struct ipv6hdr *ipv6h = nested_nh; 454 + struct ipv6_opt_hdr *opth; 455 + int offset, len; 456 + 457 + if (ipv6h->nexthdr == IPPROTO_UDP) 458 + return true; 459 + 460 + offset = sizeof(*ipv6h) + gro_hint->nested_nh_offset; 461 + while (offset + sizeof(*opth) <= gro_hint->nested_tp_offset) { 462 + opth = data + offset; 463 + 464 + len = ipv6_optlen(opth); 465 + if (len + offset > gro_hint->nested_tp_offset) 466 + return false; 467 + if (opth->nexthdr == IPPROTO_UDP) 468 + return true; 469 + 470 + offset += len; 471 + } 472 + return false; 473 + } 474 + 475 + iph = nested_nh; 476 + if (*(u8 *)iph != 0x45 || ip_is_fragment(iph) || 477 + iph->protocol != IPPROTO_UDP || ip_fast_csum((u8 *)iph, 5)) 478 + return false; 479 + 480 + return true; 481 + } 482 + 483 + /* 484 + * Validate the skb headers following the specified geneve hdr vs the 485 + * provided hint, including nested L4 checksum. 486 + * The caller already ensured that the relevant amount of data is available 487 + * in the linear part. 488 + */ 489 + static bool 490 + geneve_opt_gro_hint_validate_csum(const struct sk_buff *skb, 491 + const struct genevehdr *gh, 492 + const struct geneve_opt_gro_hint *gro_hint) 493 + { 494 + unsigned int plen, gh_len = geneve_hlen(gh); 495 + void *nested = (void *)gh + gh_len; 496 + struct udphdr *nested_uh; 497 + unsigned int nested_len; 498 + struct ipv6hdr *ipv6h; 499 + struct iphdr *iph; 500 + __wsum csum, psum; 501 + 502 + if (!geneve_opt_gro_hint_validate(nested, gro_hint)) 503 + return false; 504 + 505 + /* Use GRO hints with nested csum only if the outer header has csum. */ 506 + nested_uh = nested + gro_hint->nested_tp_offset; 507 + if (!nested_uh->check || skb->ip_summed == CHECKSUM_PARTIAL) 508 + return true; 509 + 510 + if (!NAPI_GRO_CB(skb)->csum_valid) 511 + return false; 512 + 513 + /* Compute the complete checksum up to the nested transport. */ 514 + plen = gh_len + gro_hint->nested_tp_offset; 515 + csum = csum_sub(NAPI_GRO_CB(skb)->csum, csum_partial(gh, plen, 0)); 516 + nested_len = skb_gro_len(skb) - plen; 517 + 518 + /* Compute the nested pseudo header csum. */ 519 + ipv6h = nested + gro_hint->nested_nh_offset; 520 + iph = (struct iphdr *)ipv6h; 521 + psum = gro_hint->nested_is_v6 ? 522 + ~csum_unfold(csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr, 523 + nested_len, IPPROTO_UDP, 0)) : 524 + csum_tcpudp_nofold(iph->saddr, iph->daddr, 525 + nested_len, IPPROTO_UDP, 0); 526 + 527 + return !csum_fold(csum_add(psum, csum)); 528 + } 529 + 530 + static int geneve_post_decap_hint(const struct sock *sk, struct sk_buff *skb, 531 + unsigned int gh_len, 532 + struct genevehdr **geneveh) 533 + { 534 + const struct geneve_opt_gro_hint *gro_hint; 535 + unsigned int len, total_len, hint_off; 536 + struct ipv6hdr *ipv6h; 537 + struct iphdr *iph; 538 + struct udphdr *uh; 539 + __be16 p; 540 + 541 + hint_off = geneve_sk_gro_hint_off(sk, *geneveh, &p, &len); 542 + if (!hint_off) 543 + return 0; 544 + 545 + if (!skb_is_gso(skb)) 546 + return 0; 547 + 548 + gro_hint = geneve_opt_gro_hint(*geneveh, hint_off); 549 + if (unlikely(!pskb_may_pull(skb, gro_hint->nested_hdr_len))) 550 + return -ENOMEM; 551 + 552 + *geneveh = geneve_hdr(skb); 553 + gro_hint = geneve_opt_gro_hint(*geneveh, hint_off); 554 + 555 + /* 556 + * Validate hints from untrusted source before accessing 557 + * the headers; csum will be checked later by the nested 558 + * protocol rx path. 559 + */ 560 + if (unlikely(skb_shinfo(skb)->gso_type & SKB_GSO_DODGY && 561 + !geneve_opt_gro_hint_validate(skb->data, gro_hint))) 562 + return -EINVAL; 563 + 564 + ipv6h = (void *)skb->data + gro_hint->nested_nh_offset; 565 + iph = (struct iphdr *)ipv6h; 566 + total_len = skb->len - gro_hint->nested_nh_offset; 567 + if (total_len > GRO_LEGACY_MAX_SIZE) 568 + return -E2BIG; 569 + 570 + /* 571 + * After stripping the outer encap, the packet still carries a 572 + * tunnel encapsulation: the nested one. 573 + */ 574 + skb->encapsulation = 1; 575 + 576 + /* GSO expect a valid transpor header, move it to the current one. */ 577 + skb_set_transport_header(skb, gro_hint->nested_tp_offset); 578 + 579 + /* Adjust the nested IP{6} hdr to actual GSO len. */ 580 + if (gro_hint->nested_is_v6) { 581 + ipv6h->payload_len = htons(total_len - sizeof(*ipv6h)); 582 + } else { 583 + __be16 old_len = iph->tot_len; 584 + 585 + iph->tot_len = htons(total_len); 586 + 587 + /* For IPv4 additionally adjust the nested csum. */ 588 + csum_replace2(&iph->check, old_len, iph->tot_len); 589 + ip_send_check(iph); 590 + } 591 + 592 + /* Adjust the nested UDP header len and checksum. */ 593 + uh = udp_hdr(skb); 594 + uh->len = htons(skb->len - gro_hint->nested_tp_offset); 595 + if (uh->check) { 596 + len = skb->len - gro_hint->nested_nh_offset; 597 + skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL_CSUM; 598 + if (gro_hint->nested_is_v6) 599 + uh->check = ~udp_v6_check(len, &ipv6h->saddr, 600 + &ipv6h->daddr, 0); 601 + else 602 + uh->check = ~udp_v4_check(len, iph->saddr, 603 + iph->daddr, 0); 604 + } else { 605 + skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL; 606 + } 607 + return 0; 404 608 } 405 609 406 610 /* Callback from net/ipv4/udp.c to receive packets */ ··· 688 404 goto drop; 689 405 } 690 406 691 - geneve_rx(geneve, gs, skb); 407 + /* 408 + * After hint processing, the transport header points to the inner one 409 + * and we can't use anymore on geneve_hdr(). 410 + */ 411 + geneveh = geneve_hdr(skb); 412 + if (geneve_post_decap_hint(sk, skb, sizeof(struct genevehdr) + 413 + opts_len, &geneveh)) { 414 + DEV_STATS_INC(geneve->dev, rx_errors); 415 + goto drop; 416 + } 417 + 418 + geneve_rx(geneve, gs, skb, geneveh); 692 419 return 0; 693 420 694 421 drop: ··· 790 495 return sock; 791 496 } 792 497 793 - static int geneve_hlen(struct genevehdr *gh) 498 + static bool geneve_hdr_match(struct sk_buff *skb, 499 + const struct genevehdr *gh, 500 + const struct genevehdr *gh2, 501 + unsigned int hint_off) 794 502 { 795 - return sizeof(*gh) + gh->opt_len * 4; 503 + const struct geneve_opt_gro_hint *gro_hint; 504 + void *nested, *nested2, *nh, *nh2; 505 + struct udphdr *udp, *udp2; 506 + unsigned int gh_len; 507 + 508 + /* Match the geneve hdr and options */ 509 + if (gh->opt_len != gh2->opt_len) 510 + return false; 511 + 512 + gh_len = geneve_hlen(gh); 513 + if (memcmp(gh, gh2, gh_len)) 514 + return false; 515 + 516 + if (!hint_off) 517 + return true; 518 + 519 + /* 520 + * When gro is present consider the nested headers as part 521 + * of the geneve options 522 + */ 523 + nested = (void *)gh + gh_len; 524 + nested2 = (void *)gh2 + gh_len; 525 + gro_hint = geneve_opt_gro_hint(gh, hint_off); 526 + if (!memcmp(nested, nested2, gro_hint->nested_hdr_len)) 527 + return true; 528 + 529 + /* 530 + * The nested headers differ; the packets can still belong to 531 + * the same flow when IPs/proto/ports match; if so flushing is 532 + * required. 533 + */ 534 + nh = nested + gro_hint->nested_nh_offset; 535 + nh2 = nested2 + gro_hint->nested_nh_offset; 536 + if (gro_hint->nested_is_v6) { 537 + struct ipv6hdr *iph = nh, *iph2 = nh2; 538 + unsigned int nested_nlen; 539 + __be32 first_word; 540 + 541 + first_word = *(__be32 *)iph ^ *(__be32 *)iph2; 542 + if ((first_word & htonl(0xF00FFFFF)) || 543 + !ipv6_addr_equal(&iph->saddr, &iph2->saddr) || 544 + !ipv6_addr_equal(&iph->daddr, &iph2->daddr) || 545 + iph->nexthdr != iph2->nexthdr) 546 + return false; 547 + 548 + nested_nlen = gro_hint->nested_tp_offset - 549 + gro_hint->nested_nh_offset; 550 + if (nested_nlen > sizeof(struct ipv6hdr) && 551 + (memcmp(iph + 1, iph2 + 1, 552 + nested_nlen - sizeof(struct ipv6hdr)))) 553 + return false; 554 + } else { 555 + struct iphdr *iph = nh, *iph2 = nh2; 556 + 557 + if ((iph->protocol ^ iph2->protocol) | 558 + ((__force u32)iph->saddr ^ (__force u32)iph2->saddr) | 559 + ((__force u32)iph->daddr ^ (__force u32)iph2->daddr)) 560 + return false; 561 + } 562 + 563 + udp = nested + gro_hint->nested_tp_offset; 564 + udp2 = nested2 + gro_hint->nested_tp_offset; 565 + if (udp->source != udp2->source || udp->dest != udp2->dest || 566 + udp->check != udp2->check) 567 + return false; 568 + 569 + NAPI_GRO_CB(skb)->flush = 1; 570 + return true; 796 571 } 797 572 798 573 static struct sk_buff *geneve_gro_receive(struct sock *sk, 799 574 struct list_head *head, 800 575 struct sk_buff *skb) 801 576 { 577 + unsigned int hlen, gh_len, off_gnv, hint_off; 578 + const struct geneve_opt_gro_hint *gro_hint; 579 + const struct packet_offload *ptype; 580 + struct genevehdr *gh, *gh2; 802 581 struct sk_buff *pp = NULL; 803 582 struct sk_buff *p; 804 - struct genevehdr *gh, *gh2; 805 - unsigned int hlen, gh_len, off_gnv; 806 - const struct packet_offload *ptype; 807 - __be16 type; 808 583 int flush = 1; 584 + __be16 type; 809 585 810 586 off_gnv = skb_gro_offset(skb); 811 587 hlen = off_gnv + sizeof(*gh); ··· 887 521 if (gh->ver != GENEVE_VER || gh->oam) 888 522 goto out; 889 523 gh_len = geneve_hlen(gh); 524 + type = gh->proto_type; 890 525 891 526 hlen = off_gnv + gh_len; 892 527 if (!skb_gro_may_pull(skb, hlen)) { ··· 896 529 goto out; 897 530 } 898 531 532 + /* The GRO hint/nested hdr could use a different ethernet type. */ 533 + hint_off = geneve_sk_gro_hint_off(sk, gh, &type, &gh_len); 534 + if (hint_off) { 535 + 536 + /* 537 + * If the hint is present, and nested hdr validation fails, do 538 + * not attempt plain GRO: it will ignore inner hdrs and cause 539 + * OoO. 540 + */ 541 + gh = skb_gro_header(skb, off_gnv + gh_len, off_gnv); 542 + if (unlikely(!gh)) 543 + goto out; 544 + 545 + gro_hint = geneve_opt_gro_hint(gh, hint_off); 546 + if (!geneve_opt_gro_hint_validate_csum(skb, gh, gro_hint)) 547 + goto out; 548 + } 549 + 899 550 list_for_each_entry(p, head, list) { 900 551 if (!NAPI_GRO_CB(p)->same_flow) 901 552 continue; 902 553 903 554 gh2 = (struct genevehdr *)(p->data + off_gnv); 904 - if (gh->opt_len != gh2->opt_len || 905 - memcmp(gh, gh2, gh_len)) { 555 + if (!geneve_hdr_match(skb, gh, gh2, hint_off)) { 906 556 NAPI_GRO_CB(p)->same_flow = 0; 907 557 continue; 908 558 } ··· 927 543 928 544 skb_gro_pull(skb, gh_len); 929 545 skb_gro_postpull_rcsum(skb, gh, gh_len); 930 - type = gh->proto_type; 931 546 if (likely(type == htons(ETH_P_TEB))) 932 547 return call_gro_receive(eth_gro_receive, head, skb); 933 548 ··· 955 572 gh = (struct genevehdr *)(skb->data + nhoff); 956 573 gh_len = geneve_hlen(gh); 957 574 type = gh->proto_type; 575 + geneve_opt_gro_hint_off(gh, &type, &gh_len); 958 576 959 577 /* since skb->encapsulation is set, eth_gro_complete() sets the inner mac header */ 960 578 if (likely(type == htons(ETH_P_TEB))) ··· 1043 659 1044 660 static struct geneve_sock *geneve_find_sock(struct geneve_net *gn, 1045 661 sa_family_t family, 1046 - __be16 dst_port) 662 + __be16 dst_port, 663 + bool gro_hint) 1047 664 { 1048 665 struct geneve_sock *gs; 1049 666 1050 667 list_for_each_entry(gs, &gn->sock_list, list) { 1051 668 if (inet_sk(gs->sock->sk)->inet_sport == dst_port && 1052 - geneve_get_sk_family(gs) == family) { 669 + geneve_get_sk_family(gs) == family && 670 + gs->gro_hint == gro_hint) { 1053 671 return gs; 1054 672 } 1055 673 } ··· 1062 676 { 1063 677 struct net *net = geneve->net; 1064 678 struct geneve_net *gn = net_generic(net, geneve_net_id); 679 + bool gro_hint = geneve->cfg.gro_hint; 1065 680 struct geneve_dev_node *node; 1066 681 struct geneve_sock *gs; 1067 682 __u8 vni[3]; 1068 683 __u32 hash; 1069 684 1070 - gs = geneve_find_sock(gn, ipv6 ? AF_INET6 : AF_INET, geneve->cfg.info.key.tp_dst); 685 + gs = geneve_find_sock(gn, ipv6 ? AF_INET6 : AF_INET, 686 + geneve->cfg.info.key.tp_dst, gro_hint); 1071 687 if (gs) { 1072 688 gs->refcnt++; 1073 689 goto out; ··· 1082 694 1083 695 out: 1084 696 gs->collect_md = geneve->cfg.collect_md; 697 + gs->gro_hint = gro_hint; 1085 698 #if IS_ENABLED(CONFIG_IPV6) 1086 699 if (ipv6) { 1087 700 rcu_assign_pointer(geneve->sock6, gs); ··· 1155 766 ip_tunnel_info_opts_get(geneveh->options, info); 1156 767 } 1157 768 769 + static int geneve_build_gro_hint_opt(const struct geneve_dev *geneve, 770 + struct sk_buff *skb) 771 + { 772 + struct geneve_skb_cb *cb = GENEVE_SKB_CB(skb); 773 + struct geneve_opt_gro_hint *hint; 774 + unsigned int nhlen; 775 + bool nested_is_v6; 776 + int id; 777 + 778 + BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct geneve_skb_cb)); 779 + cb->gro_hint_len = 0; 780 + 781 + /* Try to add the GRO hint only in case of double encap. */ 782 + if (!geneve->cfg.gro_hint || !skb->encapsulation) 783 + return 0; 784 + 785 + /* 786 + * The nested headers must fit the geneve opt len fields and the 787 + * nested encap must carry a nested transport (UDP) header. 788 + */ 789 + nhlen = skb_inner_mac_header(skb) - skb->data; 790 + if (nhlen > 255 || !skb_transport_header_was_set(skb) || 791 + skb->inner_protocol_type != ENCAP_TYPE_ETHER || 792 + (skb_transport_offset(skb) + sizeof(struct udphdr) > nhlen)) 793 + return 0; 794 + 795 + id = proto_to_id(skb->inner_protocol); 796 + if (id < 0) 797 + return 0; 798 + 799 + nested_is_v6 = skb->protocol == htons(ETH_P_IPV6); 800 + if (nested_is_v6) { 801 + int start = skb_network_offset(skb) + sizeof(struct ipv6hdr); 802 + u8 proto = ipv6_hdr(skb)->nexthdr; 803 + __be16 foff; 804 + 805 + if (ipv6_skip_exthdr(skb, start, &proto, &foff) < 0 || 806 + proto != IPPROTO_UDP) 807 + return 0; 808 + } else { 809 + if (ip_hdr(skb)->protocol != IPPROTO_UDP) 810 + return 0; 811 + } 812 + 813 + hint = &cb->gro_hint; 814 + memset(hint, 0, sizeof(*hint)); 815 + hint->inner_proto_id = id; 816 + hint->nested_is_v6 = skb->protocol == htons(ETH_P_IPV6); 817 + hint->nested_nh_offset = skb_network_offset(skb); 818 + hint->nested_tp_offset = skb_transport_offset(skb); 819 + hint->nested_hdr_len = nhlen; 820 + cb->gro_hint_len = GENEVE_OPT_GRO_HINT_SIZE; 821 + return GENEVE_OPT_GRO_HINT_SIZE; 822 + } 823 + 824 + static void geneve_put_gro_hint_opt(struct genevehdr *gnvh, int opt_size, 825 + const struct geneve_opt_gro_hint *hint) 826 + { 827 + struct geneve_opt *gro_opt; 828 + 829 + /* geneve_build_header() did not took in account the GRO hint. */ 830 + gnvh->opt_len = (opt_size + GENEVE_OPT_GRO_HINT_SIZE) >> 2; 831 + 832 + gro_opt = (void *)(gnvh + 1) + opt_size; 833 + memset(gro_opt, 0, sizeof(*gro_opt)); 834 + 835 + gro_opt->opt_class = htons(GENEVE_OPT_NETDEV_CLASS); 836 + gro_opt->type = GENEVE_OPT_GRO_HINT_TYPE; 837 + gro_opt->length = GENEVE_OPT_GRO_HINT_LEN; 838 + memcpy(gro_opt + 1, hint, sizeof(*hint)); 839 + } 840 + 1158 841 static int geneve_build_skb(struct dst_entry *dst, struct sk_buff *skb, 1159 842 const struct ip_tunnel_info *info, 1160 - bool xnet, int ip_hdr_len, 1161 - bool inner_proto_inherit) 843 + const struct geneve_dev *geneve, int ip_hdr_len) 1162 844 { 1163 845 bool udp_sum = test_bit(IP_TUNNEL_CSUM_BIT, info->key.tun_flags); 846 + bool inner_proto_inherit = geneve->cfg.inner_proto_inherit; 847 + bool xnet = !net_eq(geneve->net, dev_net(geneve->dev)); 848 + struct geneve_skb_cb *cb = GENEVE_SKB_CB(skb); 1164 849 struct genevehdr *gnvh; 1165 850 __be16 inner_proto; 851 + bool double_encap; 1166 852 int min_headroom; 853 + int opt_size; 1167 854 int err; 1168 855 1169 856 skb_reset_mac_header(skb); 1170 857 skb_scrub_packet(skb, xnet); 1171 858 859 + opt_size = info->options_len + cb->gro_hint_len; 1172 860 min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len + 1173 - GENEVE_BASE_HLEN + info->options_len + ip_hdr_len; 861 + GENEVE_BASE_HLEN + opt_size + ip_hdr_len; 1174 862 err = skb_cow_head(skb, min_headroom); 1175 863 if (unlikely(err)) 1176 864 goto free_dst; 1177 865 866 + double_encap = udp_tunnel_handle_partial(skb); 1178 867 err = udp_tunnel_handle_offloads(skb, udp_sum); 1179 868 if (err) 1180 869 goto free_dst; 1181 870 1182 - gnvh = __skb_push(skb, sizeof(*gnvh) + info->options_len); 871 + gnvh = __skb_push(skb, sizeof(*gnvh) + opt_size); 1183 872 inner_proto = inner_proto_inherit ? skb->protocol : htons(ETH_P_TEB); 1184 873 geneve_build_header(gnvh, info, inner_proto); 1185 - skb_set_inner_protocol(skb, inner_proto); 874 + 875 + if (cb->gro_hint_len) 876 + geneve_put_gro_hint_opt(gnvh, info->options_len, &cb->gro_hint); 877 + 878 + udp_tunnel_set_inner_protocol(skb, double_encap, inner_proto); 1186 879 return 0; 1187 880 1188 881 free_dst: ··· 1292 821 struct geneve_dev *geneve, 1293 822 const struct ip_tunnel_info *info) 1294 823 { 1295 - bool inner_proto_inherit = geneve->cfg.inner_proto_inherit; 1296 - bool xnet = !net_eq(geneve->net, dev_net(geneve->dev)); 1297 824 struct geneve_sock *gs4 = rcu_dereference(geneve->sock4); 1298 825 const struct ip_tunnel_key *key = &info->key; 1299 826 struct rtable *rt; ··· 1302 833 __be16 sport; 1303 834 int err; 1304 835 1305 - if (skb_vlan_inet_prepare(skb, inner_proto_inherit)) 836 + if (skb_vlan_inet_prepare(skb, geneve->cfg.inner_proto_inherit)) 1306 837 return -EINVAL; 1307 838 1308 839 if (!gs4) ··· 1323 854 return PTR_ERR(rt); 1324 855 1325 856 err = skb_tunnel_check_pmtu(skb, &rt->dst, 1326 - GENEVE_IPV4_HLEN + info->options_len, 857 + GENEVE_IPV4_HLEN + info->options_len + 858 + geneve_build_gro_hint_opt(geneve, skb), 1327 859 netif_is_any_bridge_port(dev)); 1328 860 if (err < 0) { 1329 861 dst_release(&rt->dst); ··· 1386 916 } 1387 917 } 1388 918 1389 - err = geneve_build_skb(&rt->dst, skb, info, xnet, sizeof(struct iphdr), 1390 - inner_proto_inherit); 919 + err = geneve_build_skb(&rt->dst, skb, info, geneve, 920 + sizeof(struct iphdr)); 1391 921 if (unlikely(err)) 1392 922 return err; 1393 923 ··· 1404 934 struct geneve_dev *geneve, 1405 935 const struct ip_tunnel_info *info) 1406 936 { 1407 - bool inner_proto_inherit = geneve->cfg.inner_proto_inherit; 1408 - bool xnet = !net_eq(geneve->net, dev_net(geneve->dev)); 1409 937 struct geneve_sock *gs6 = rcu_dereference(geneve->sock6); 1410 938 const struct ip_tunnel_key *key = &info->key; 1411 939 struct dst_entry *dst = NULL; ··· 1413 945 __be16 sport; 1414 946 int err; 1415 947 1416 - if (skb_vlan_inet_prepare(skb, inner_proto_inherit)) 948 + if (skb_vlan_inet_prepare(skb, geneve->cfg.inner_proto_inherit)) 1417 949 return -EINVAL; 1418 950 1419 951 if (!gs6) ··· 1434 966 return PTR_ERR(dst); 1435 967 1436 968 err = skb_tunnel_check_pmtu(skb, dst, 1437 - GENEVE_IPV6_HLEN + info->options_len, 969 + GENEVE_IPV6_HLEN + info->options_len + 970 + geneve_build_gro_hint_opt(geneve, skb), 1438 971 netif_is_any_bridge_port(dev)); 1439 972 if (err < 0) { 1440 973 dst_release(dst); ··· 1477 1008 ttl = key->ttl; 1478 1009 ttl = ttl ? : ip6_dst_hoplimit(dst); 1479 1010 } 1480 - err = geneve_build_skb(dst, skb, info, xnet, sizeof(struct ipv6hdr), 1481 - inner_proto_inherit); 1011 + err = geneve_build_skb(dst, skb, info, geneve, sizeof(struct ipv6hdr)); 1482 1012 if (unlikely(err)) 1483 1013 return err; 1484 1014 ··· 1679 1211 dev->features |= NETIF_F_RXCSUM; 1680 1212 dev->features |= NETIF_F_GSO_SOFTWARE; 1681 1213 1214 + /* Partial features are disabled by default. */ 1682 1215 dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_FRAGLIST; 1683 1216 dev->hw_features |= NETIF_F_RXCSUM; 1684 1217 dev->hw_features |= NETIF_F_GSO_SOFTWARE; 1218 + dev->hw_features |= UDP_TUNNEL_PARTIAL_FEATURES; 1219 + dev->hw_features |= NETIF_F_GSO_PARTIAL; 1220 + 1221 + dev->hw_enc_features = dev->hw_features; 1222 + dev->gso_partial_features = UDP_TUNNEL_PARTIAL_FEATURES; 1223 + dev->mangleid_features = NETIF_F_GSO_PARTIAL; 1685 1224 1686 1225 dev->pcpu_stat_type = NETDEV_PCPU_STAT_DSTATS; 1687 1226 /* MTU range: 68 - (something less than 65535) */ ··· 1723 1248 [IFLA_GENEVE_DF] = { .type = NLA_U8 }, 1724 1249 [IFLA_GENEVE_INNER_PROTO_INHERIT] = { .type = NLA_FLAG }, 1725 1250 [IFLA_GENEVE_PORT_RANGE] = NLA_POLICY_EXACT_LEN(sizeof(struct ifla_geneve_port_range)), 1251 + [IFLA_GENEVE_GRO_HINT] = { .type = NLA_FLAG }, 1726 1252 }; 1727 1253 1728 1254 static int geneve_validate(struct nlattr *tb[], struct nlattr *data[], ··· 2074 1598 cfg->inner_proto_inherit = true; 2075 1599 } 2076 1600 1601 + if (data[IFLA_GENEVE_GRO_HINT]) { 1602 + if (changelink) { 1603 + attrtype = IFLA_GENEVE_GRO_HINT; 1604 + goto change_notsup; 1605 + } 1606 + cfg->gro_hint = true; 1607 + } 1608 + 2077 1609 return 0; 2078 1610 change_notsup: 2079 1611 NL_SET_ERR_MSG_ATTR(extack, data[attrtype], 2080 - "Changing VNI, Port, endpoint IP address family, external, inner_proto_inherit, and UDP checksum attributes are not supported"); 1612 + "Changing VNI, Port, endpoint IP address family, external, inner_proto_inherit, gro_hint and UDP checksum attributes are not supported"); 2081 1613 return -EOPNOTSUPP; 2082 1614 } 2083 1615 ··· 2268 1784 nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_TTL_INHERIT */ 2269 1785 nla_total_size(0) + /* IFLA_GENEVE_INNER_PROTO_INHERIT */ 2270 1786 nla_total_size(sizeof(struct ifla_geneve_port_range)) + /* IFLA_GENEVE_PORT_RANGE */ 1787 + nla_total_size(0) + /* IFLA_GENEVE_GRO_HINT */ 2271 1788 0; 2272 1789 } 2273 1790 ··· 2339 1854 goto nla_put_failure; 2340 1855 2341 1856 if (nla_put(skb, IFLA_GENEVE_PORT_RANGE, sizeof(ports), &ports)) 1857 + goto nla_put_failure; 1858 + 1859 + if (geneve->cfg.gro_hint && 1860 + nla_put_flag(skb, IFLA_GENEVE_GRO_HINT)) 2342 1861 goto nla_put_failure; 2343 1862 2344 1863 return 0;
+14 -4
drivers/net/vxlan/vxlan_core.c
··· 2183 2183 struct vxlan_metadata *md, u32 vxflags, 2184 2184 bool udp_sum) 2185 2185 { 2186 - struct vxlanhdr *vxh; 2187 - int min_headroom; 2188 - int err; 2189 2186 int type = udp_sum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL; 2190 2187 __be16 inner_protocol = htons(ETH_P_TEB); 2188 + struct vxlanhdr *vxh; 2189 + bool double_encap; 2190 + int min_headroom; 2191 + int err; 2191 2192 2192 2193 if ((vxflags & VXLAN_F_REMCSUM_TX) && 2193 2194 skb->ip_summed == CHECKSUM_PARTIAL) { ··· 2209 2208 if (unlikely(err)) 2210 2209 return err; 2211 2210 2211 + double_encap = udp_tunnel_handle_partial(skb); 2212 2212 err = iptunnel_handle_offloads(skb, type); 2213 2213 if (err) 2214 2214 return err; ··· 2240 2238 inner_protocol = skb->protocol; 2241 2239 } 2242 2240 2243 - skb_set_inner_protocol(skb, inner_protocol); 2241 + udp_tunnel_set_inner_protocol(skb, double_encap, inner_protocol); 2244 2242 return 0; 2245 2243 } 2246 2244 ··· 3350 3348 dev->features |= NETIF_F_RXCSUM; 3351 3349 dev->features |= NETIF_F_GSO_SOFTWARE; 3352 3350 3351 + /* Partial features are disabled by default. */ 3353 3352 dev->vlan_features = dev->features; 3354 3353 dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_FRAGLIST; 3355 3354 dev->hw_features |= NETIF_F_RXCSUM; 3356 3355 dev->hw_features |= NETIF_F_GSO_SOFTWARE; 3356 + dev->hw_features |= UDP_TUNNEL_PARTIAL_FEATURES; 3357 + dev->hw_features |= NETIF_F_GSO_PARTIAL; 3358 + 3359 + dev->hw_enc_features = dev->hw_features; 3360 + dev->gso_partial_features = UDP_TUNNEL_PARTIAL_FEATURES; 3361 + dev->mangleid_features = NETIF_F_GSO_PARTIAL; 3362 + 3357 3363 netif_keep_dst(dev); 3358 3364 dev->priv_flags |= IFF_NO_QUEUE; 3359 3365 dev->change_proto_down = true;
+3
include/linux/netdevice.h
··· 1831 1831 * 1832 1832 * @mpls_features: Mask of features inheritable by MPLS 1833 1833 * @gso_partial_features: value(s) from NETIF_F_GSO\* 1834 + * @mangleid_features: Mask of features requiring MANGLEID, will be 1835 + * disabled together with the latter. 1834 1836 * 1835 1837 * @ifindex: interface index 1836 1838 * @group: The group the device belongs to ··· 2221 2219 netdev_features_t vlan_features; 2222 2220 netdev_features_t hw_enc_features; 2223 2221 netdev_features_t mpls_features; 2222 + netdev_features_t mangleid_features; 2224 2223 2225 2224 unsigned int min_mtu; 2226 2225 unsigned int max_mtu;
+32
include/net/udp_tunnel.h
··· 10 10 #include <net/ipv6_stubs.h> 11 11 #endif 12 12 13 + #define UDP_TUNNEL_PARTIAL_FEATURES NETIF_F_GSO_ENCAP_ALL 14 + #define UDP_TUNNEL_STRIPPED_GSO_TYPES ((UDP_TUNNEL_PARTIAL_FEATURES | \ 15 + NETIF_F_GSO_PARTIAL) >> \ 16 + NETIF_F_GSO_SHIFT) 17 + 13 18 struct udp_port_cfg { 14 19 u8 family; 15 20 ··· 149 144 __u8 prio, __u8 ttl, __be32 label, 150 145 __be16 src_port, __be16 dst_port, bool nocheck, 151 146 u16 ip6cb_flags); 147 + 148 + static inline bool udp_tunnel_handle_partial(struct sk_buff *skb) 149 + { 150 + bool double_encap = !!(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL); 151 + 152 + /* 153 + * If the skb went through partial segmentation, lower devices 154 + * will not need to offload the related features - except for 155 + * UDP_TUNNEL, that will be re-added by the later 156 + * udp_tunnel_handle_offloads(). 157 + */ 158 + if (double_encap) 159 + skb_shinfo(skb)->gso_type &= ~UDP_TUNNEL_STRIPPED_GSO_TYPES; 160 + return double_encap; 161 + } 162 + 163 + static inline void udp_tunnel_set_inner_protocol(struct sk_buff *skb, 164 + bool double_encap, 165 + __be16 inner_proto) 166 + { 167 + /* 168 + * The inner protocol has been set by the nested tunnel, don't 169 + * overraid it. 170 + */ 171 + if (!double_encap) 172 + skb_set_inner_protocol(skb, inner_proto); 173 + } 152 174 153 175 void udp_tunnel_sock_release(struct socket *sock); 154 176
+1
include/uapi/linux/if_link.h
··· 1443 1443 IFLA_GENEVE_DF, 1444 1444 IFLA_GENEVE_INNER_PROTO_INHERIT, 1445 1445 IFLA_GENEVE_PORT_RANGE, 1446 + IFLA_GENEVE_GRO_HINT, 1446 1447 __IFLA_GENEVE_MAX 1447 1448 }; 1448 1449 #define IFLA_GENEVE_MAX (__IFLA_GENEVE_MAX - 1)
+4 -1
net/core/dev.c
··· 3802 3802 inner_ip_hdr(skb) : ip_hdr(skb); 3803 3803 3804 3804 if (!(iph->frag_off & htons(IP_DF))) 3805 - features &= ~NETIF_F_TSO_MANGLEID; 3805 + features &= ~dev->mangleid_features; 3806 3806 } 3807 3807 3808 3808 /* NETIF_F_IPV6_CSUM does not support IPv6 extension headers, ··· 11401 11401 dev->mpls_features |= NETIF_F_TSO_MANGLEID; 11402 11402 if (dev->hw_enc_features & NETIF_F_TSO) 11403 11403 dev->hw_enc_features |= NETIF_F_TSO_MANGLEID; 11404 + 11405 + /* TSO_MANGLEID belongs in mangleid_features by definition */ 11406 + dev->mangleid_features |= NETIF_F_TSO_MANGLEID; 11404 11407 11405 11408 /* Make NETIF_F_HIGHDMA inheritable to VLAN devices. 11406 11409 */
+1
tools/testing/selftests/net/Makefile
··· 22 22 cmsg_so_mark.sh \ 23 23 cmsg_so_priority.sh \ 24 24 cmsg_time.sh \ 25 + double_udp_encap.sh \ 25 26 drop_monitor_tests.sh \ 26 27 fcnal-ipv4.sh \ 27 28 fcnal-ipv6.sh \
+1
tools/testing/selftests/net/config
··· 77 77 CONFIG_NETFILTER=y 78 78 CONFIG_NETFILTER_ADVANCED=y 79 79 CONFIG_NETFILTER_XTABLES_LEGACY=y 80 + CONFIG_NETFILTER_XT_MATCH_BPF=m 80 81 CONFIG_NETFILTER_XT_MATCH_LENGTH=m 81 82 CONFIG_NETFILTER_XT_MATCH_POLICY=m 82 83 CONFIG_NETFILTER_XT_NAT=m
+393
tools/testing/selftests/net/double_udp_encap.sh
··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + source lib.sh 5 + 6 + # shellcheck disable=SC2155 # prefer RO variable over return value from cmd 7 + readonly CLI="$(dirname "$(readlink -f "$0")")/../../../net/ynl/pyynl/cli.py" 8 + 9 + readonly SRC=1 10 + readonly DST=2 11 + 12 + readonly NET_V4=192.168.1. 13 + readonly NET_V6=2001:db8:: 14 + readonly OL1_NET_V4=172.16.1. 15 + readonly OL1_NET_V6=2001:db8:1:: 16 + readonly OL2_NET_V4=172.16.2. 17 + readonly OL2_NET_V6=2001:db8:2:: 18 + 19 + trap cleanup_all_ns EXIT 20 + 21 + # shellcheck disable=SC2329 # can't figure out usage trough a variable 22 + is_ipv6() { 23 + if [[ $1 =~ .*:.* ]]; then 24 + return 0 25 + fi 26 + return 1 27 + } 28 + 29 + # shellcheck disable=SC2329 # can't figure out usage trough a variable 30 + create_gnv_endpoint() { 31 + local -r netns=$1 32 + local -r bm_rem_addr=$2 33 + local -r gnv_dev=$3 34 + local -r gnv_id=$4 35 + local opts=$5 36 + local gnv_json 37 + local rem 38 + 39 + if is_ipv6 "$bm_rem_addr"; then 40 + rem=remote6 41 + else 42 + rem=remote 43 + fi 44 + 45 + # add ynl opt separator, if needed 46 + [ -n "$opts" ] && opts=", $opts" 47 + 48 + gnv_json="{ \"id\": $gnv_id, \"$rem\": \"$bm_rem_addr\"$opts }" 49 + ip netns exec "$netns" "$CLI" --family rt-link --create --excl \ 50 + --do newlink --json "{\"ifname\": \"$gnv_dev\", 51 + \"linkinfo\": {\"kind\":\"geneve\", 52 + \"data\": $gnv_json } }" > /dev/null 53 + ip -n "$netns" link set dev "$gnv_dev" up 54 + } 55 + 56 + # shellcheck disable=SC2329 # can't figure out usage trough a variable 57 + create_vxlan_endpoint() { 58 + local -r netns=$1 59 + local -r bm_rem_addr=$2 60 + local -r vxlan_dev=$3 61 + local -r vxlan_id=$4 62 + local -r opts_str=$5 63 + local oldifs 64 + local -a opts 65 + local opt 66 + 67 + # convert the arguments from yaml format 68 + oldifs=$IFS 69 + IFS=',' 70 + for opt in $opts_str; do 71 + local pattern='"port":' 72 + 73 + [ -n "$opt" ] || continue 74 + 75 + opts+=("${opt/$pattern*/dstport}" "${opt/$pattern/}") 76 + done 77 + IFS=$oldifs 78 + [ ${#opts[@]} -gt 0 ] || opts+=("dstport" "4789") 79 + 80 + ip -n "$netns" link add "$vxlan_dev" type vxlan id "$vxlan_id" \ 81 + remote "$bm_rem_addr" "${opts[@]}" 82 + ip -n "$netns" link set dev "$vxlan_dev" up 83 + } 84 + 85 + create_ns() { 86 + local nested_opt='"port":6082' 87 + local create_endpoint 88 + local options="$1" 89 + local feature 90 + local dev 91 + local id 92 + local ns 93 + 94 + RET=0 95 + 96 + # +-------------+ +-------------+ 97 + # | NS_SRC | | NS_NST_DST | 98 + # | | | | 99 + # | gnv_nst1 | | gnv_nst2 | 100 + # | + | | + | 101 + # | | | | | | 102 + # | + | | + | 103 + # | gnv1 | | gnv2 | 104 + # | + | | + | 105 + # | | | | | | 106 + # | + veth1 +--------+ veth2 + | 107 + # | | | | 108 + # +-------------+ +-------------+ 109 + 110 + setup_ns NS_SRC NS_DST 111 + 112 + # concatenate caller provided options and default one 113 + [ -n "$2" ] && nested_opt="$nested_opt,$2" 114 + 115 + ip link add name "veth$SRC" netns "$NS_SRC" type veth \ 116 + peer name "veth$DST" netns "$NS_DST" 117 + case "$ENCAP" in 118 + vxlan) 119 + create_endpoint=create_vxlan_endpoint 120 + dev=vx 121 + ;; 122 + geneve) 123 + create_endpoint=create_gnv_endpoint 124 + dev=gnv 125 + ;; 126 + esac 127 + 128 + id=1 129 + for ns in "${NS_LIST[@]}"; do 130 + ip -n "$ns" link set dev "veth$id" up 131 + 132 + # ensure the sender can do large write just after 3whs 133 + ip netns exec "$ns" \ 134 + sysctl -qw net.ipv4.tcp_wmem="4096 4194304 4194304" 135 + 136 + # note that 3 - $SRC == $DST and 3 - $DST == $SRC 137 + if [ $FAMILY = "4" ]; then 138 + ip -n "$ns" addr add dev "veth$id" "$NET_V4$id/24" 139 + $create_endpoint "$ns" "$NET_V4$((3 - id))" \ 140 + "$dev$id" 4 "$options" 141 + ip -n "$ns" addr add dev "$dev$id" "$OL1_NET_V4$id/24" 142 + 143 + # nested tunnel devices 144 + # pmtu can't be propagated to upper layer devices; 145 + # need manual adjust 146 + $create_endpoint "$ns" "$OL1_NET_V4$((3 - id))" \ 147 + "$dev"_nst"$id" 40 "$nested_opt" 148 + ip -n "$ns" addr add dev "$dev"_nst"$id" \ 149 + "$OL2_NET_V4$id/24" 150 + ip -n "$ns" link set dev "$dev"_nst"$id" mtu 1392 151 + else 152 + ip -n "$ns" addr add dev "veth$id" "$NET_V6$id/64" \ 153 + nodad 154 + $create_endpoint "$ns" "$NET_V6$((3 - id))" \ 155 + "$dev"6"$id" 6 "$options" 156 + ip -n "$ns" addr add dev "$dev"6"$id" \ 157 + "$OL1_NET_V6$id/64" nodad 158 + 159 + $create_endpoint "$ns" "$OL1_NET_V6$((3 - id))" \ 160 + "$dev"6_nst"$id" 60 "$nested_opt" 161 + ip -n "$ns" addr add dev "$dev"6_nst"$id" \ 162 + "$OL2_NET_V6$id/64" nodad 163 + ip -n "$ns" link set dev "$dev"6_nst"$id" mtu 1352 164 + fi 165 + id=$((id+1)) 166 + done 167 + 168 + # enable GRO heuristic on the veth peer and ensure UDP L4 over tunnel is 169 + # actually segmented 170 + for feature in tso tx-udp_tnl-segmentation; do 171 + ip netns exec "$NS_SRC" ethtool -K "veth$SRC" \ 172 + "$feature" off 2>/dev/null 173 + done 174 + } 175 + 176 + create_ns_gso() { 177 + local dev 178 + 179 + create_ns "$@" 180 + if [ "$ENCAP" = "geneve" ]; then 181 + dev=gnv 182 + else 183 + dev=vx 184 + fi 185 + [ "$FAMILY" = "6" ] && dev="$dev"6 186 + ip netns exec "$NS_SRC" ethtool -K "$dev$SRC" \ 187 + tx-gso-partial on \ 188 + tx-udp_tnl-segmentation on \ 189 + tx-udp_tnl-csum-segmentation on 190 + } 191 + 192 + create_ns_gso_gro() { 193 + create_ns_gso "$@" 194 + ip netns exec "$NS_DST" ethtool -K "veth$DST" gro on 195 + ip netns exec "$NS_SRC" ethtool -K "veth$SRC" tx off >/dev/null 2>&1 196 + } 197 + 198 + run_test() { 199 + local -r dst=$NET$DST 200 + local -r msg=$1 201 + local -r total_size=$2 202 + local -r encappkts=$3 203 + local inner_proto_offset=0 204 + local inner_maclen=14 205 + local rx_family="-4" 206 + local ipt=iptables 207 + local bpf_filter 208 + local -a rx_args 209 + local wire_pkts 210 + local rcvpkts 211 + local encl=8 212 + local dport 213 + local pkts 214 + local snd 215 + 216 + if [ $FAMILY = "6" ]; then 217 + ipt=ip6tables 218 + else 219 + # rx program does not support '-6' and implies ipv6 usage by 220 + # default 221 + rx_args=("$rx_family") 222 + fi 223 + 224 + # The received can only check fixed size packet 225 + pkts=$((total_size / GSO_SIZE)) 226 + if [ -n "$4" ]; then 227 + wire_pkts=$4 228 + elif [ $((total_size % GSO_SIZE)) -eq 0 ]; then 229 + wire_pkts=1 230 + rx_args+=("-l" "$GSO_SIZE") 231 + else 232 + wire_pkts=2 233 + pkts=$((pkts + 1)) 234 + fi 235 + 236 + if [ "$ENCAP" = "geneve" ]; then 237 + dport=6081 238 + else 239 + dport=4789 240 + fi 241 + 242 + # Either: 243 + # - IPv4, nested tunnel carries UDP over IPv4, with dport 6082, 244 + # innermost is TCP over IPv4 on port 8000 245 + # - IPv6, nested tunnel carries UDP over IPv6, with dport 6082, 246 + # innermost is TCP over IPv6 on port 8000 247 + # The nested tunnel port is 6082 and the nested encap len is 8 248 + # regardless of the encap type (no geneve opts). 249 + # In inherit protocol mode there is no nested mac hdr and the nested 250 + # l3 protocol type field belongs to the geneve hdr. 251 + [ "$USE_HINT" = true ] && encl=16 252 + [ "$INHERIT" = true ] && inner_maclen=0 253 + [ "$INHERIT" = true ] && inner_proto_offset=-4 254 + local inner=$((inner_maclen+encl)) 255 + local proto=$((inner_maclen+encl+inner_proto_offset)) 256 + bpf_filter=$(nfbpf_compile "(ip && 257 + ip[$((40+encl))] == 0x08 && ip[$((41+encl))] == 0x00 && 258 + ip[$((51+encl))] == 0x11 && 259 + ip[$((64+encl))] == 0x17 && ip[$((65+encl))] == 0xc2 && 260 + ip[$((76+proto))] == 0x08 && ip[$((77+proto))] == 0x00 && 261 + ip[$((87+inner))] == 0x6 && 262 + ip[$((100+inner))] == 0x1f && ip[$((101+inner))] == 0x40) || 263 + (ip6 && 264 + ip6[$((60+encl))] == 0x86 && ip6[$((61+encl))] == 0xdd && 265 + ip6[$((68+encl))] == 0x11 && 266 + ip6[$((104+encl))] == 0x17 && ip6[$((105+encl))] == 0xc2 && 267 + ip6[$((116+proto))] == 0x86 && ip6[$((117+proto))] == 0xdd && 268 + ip6[$((124+inner))] == 0x6 && 269 + ip6[$((160+inner))] == 0x1f && ip6[$((161+inner))] == 0x40)") 270 + 271 + # ignore shorts packet, to avoid arp/mld induced noise 272 + ip netns exec "$NS_SRC" "$ipt" -A OUTPUT -p udp --dport "$dport" \ 273 + -m length --length 600:65535 -m bpf --bytecode "$bpf_filter" 274 + ip netns exec "$NS_DST" "$ipt" -A INPUT -p udp --dport "$dport" \ 275 + -m length --length 600:65535 -m bpf --bytecode "$bpf_filter" 276 + ip netns exec "$NS_DST" ./udpgso_bench_rx -C 2000 -t -R 100 \ 277 + -n "$pkts" "${rx_args[@]}" & 278 + local pid=$! 279 + wait_local_port_listen "$NS_DST" 8000 tcp 280 + ip netns exec "$NS_SRC" ./udpgso_bench_tx -"$FAMILY" -t -M 1 \ 281 + -s "$total_size" -D "$dst" 282 + local ret=$? 283 + check_err "$ret" "client failure exit code $ret" 284 + wait "$pid" 285 + ret=$? 286 + check_err "$ret" "sever failure exit code $ret" 287 + 288 + snd=$(ip netns exec "$NS_SRC" "$ipt"-save -c | 289 + grep "dport $dport" | sed -e 's/\[//' -e 's/:.*//') 290 + 291 + [ "$snd" = "$wire_pkts" ] 292 + # shellcheck disable=SC2319 # known false positive 293 + check_err $? "send $snd packets on the lowest link, expected $wire_pkts" 294 + 295 + rcvpkts=$(ip netns exec "$NS_DST" "$ipt"-save -c | \ 296 + grep "dport $dport" | sed -e 's/\[//' -e 's/:.*//') 297 + 298 + [ "$rcvpkts" = "$encappkts" ] 299 + check_err $? "received $rcvpkts $ENCAP packets, expected $encappkts" 300 + log_test "$msg" 301 + } 302 + 303 + run_tests() { 304 + for FAMILY in 4 6; do 305 + NET=$OL2_NET_V4 306 + GSO_SIZE=1340 # 1392 - 20 - 32 307 + 308 + if [ $FAMILY = 6 ]; then 309 + NET=$OL2_NET_V6 310 + GSO_SIZE=1280 # 1352 - 40 - 32 311 + fi 312 + 313 + echo "IPv$FAMILY" 314 + 315 + unset USE_HINT 316 + unset INHERIT 317 + 318 + # "geneve" must be last encap in list, so that later 319 + # test cases will run on it 320 + for ENCAP in "vxlan" "geneve"; do 321 + create_ns 322 + run_test "No GSO - $ENCAP" $((GSO_SIZE * 4)) 4 4 323 + cleanup_all_ns 324 + 325 + create_ns_gso 326 + run_test "GSO without GRO - $ENCAP" $((GSO_SIZE * 4)) \ 327 + 4 1 328 + cleanup_all_ns 329 + 330 + # IPv4 only test 331 + [ $FAMILY = "4" ] || continue 332 + create_ns_gso 333 + ip netns exec "$NS_SRC" \ 334 + sysctl -qw net.ipv4.ip_no_pmtu_disc=1 335 + run_test "GSO disable due to no fixedid - $ENCAP" \ 336 + $((GSO_SIZE * 4)) 4 4 337 + cleanup_all_ns 338 + done 339 + 340 + # GRO tests imply/require geneve encap, the only one providing 341 + # GRO hints 342 + create_ns_gso_gro 343 + run_test "double tunnel GRO, no hints" $((GSO_SIZE * 4)) 4 344 + cleanup_all_ns 345 + 346 + # hint option is expected for all the following tests in the RX 347 + # path 348 + USE_HINT=true 349 + create_ns_gso_gro \ 350 + '"gro-hint":1,"udp-zero-csum6-tx":1,"udp-zero-csum6-rx":1' \ 351 + '"udp-zero-csum6-tx":1,"udp-zero-csum6-rx":1' 352 + run_test "double tunnel GRO" $((GSO_SIZE * 4)) 1 353 + cleanup_all_ns 354 + 355 + create_ns_gso_gro '"gro-hint":1,"udp-csum":1' '"udp-csum":1' 356 + run_test "double tunnel GRO - csum complete" $((GSO_SIZE * 4))\ 357 + 1 358 + cleanup_all_ns 359 + 360 + create_ns_gso_gro '"gro-hint":1' \ 361 + '"udp-csum":0,"udp-zero-csum6-tx":1,"udp-zero-csum6-rx":1' 362 + run_test "double tunnel GRO - no nested csum" \ 363 + $((GSO_SIZE * 4)) 1 364 + cleanup_all_ns 365 + 366 + create_ns_gso_gro \ 367 + '"gro-hint":1,"udp-zero-csum6-tx":1,"udp-zero-csum6-rx":1' \ 368 + '"udp-csum":1' 369 + run_test "double tunnel GRO - nested csum, outer 0-csum, skip"\ 370 + $((GSO_SIZE * 4)) 4 371 + cleanup_all_ns 372 + 373 + INHERIT=true 374 + create_ns_gso_gro '"gro-hint":1,"udp-csum":1' \ 375 + '"udp-csum":1,"inner-proto-inherit":1' 376 + run_test "double tunnel GRO - nested inherit proto" \ 377 + $((GSO_SIZE * 4)) 1 378 + cleanup_all_ns 379 + unset INHERIT 380 + 381 + create_ns_gso_gro '"gro-hint":1' 382 + run_test "double tunnel GRO - short last pkt" \ 383 + $((GSO_SIZE * 4 + GSO_SIZE / 2)) 2 384 + cleanup_all_ns 385 + done 386 + } 387 + 388 + require_command nfbpf_compile 389 + require_command jq 390 + 391 + # tcp retransmisions will break the accounting 392 + xfail_on_slow run_tests 393 + exit "$EXIT_STATUS"