1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
|
# False PMTUD result due to mtu1280.test-ipv6.com not set up correctly
I came across a strange problem where **test-ipv6.com** test fails while IPv6
connectivity works just fine. Turns out, there was nothing wrong with the
connectivity provided by Comviq, low-cost MVNO of Tele2. In summary, the false
result on the particular network is due to the following factors.
- mtu1280.test-ipv6.com is not set up correctly. The ACK packet for the large
packets still make it to the client
- Out of order delivery of IP packets over Tele2's mobile network
- The strict validation of ICMPv6 packets in Linux kernel
## mtu1280.test-ipv6.com ACK's large packets
It seems that the packet too large message is "soft-simulated" on the host
rather than on a node that sits between the test host and the internet because
the ICMP packet is originated from the same source address and the large packet
that is not supposed to be undeliverable to the host is apparently delivered and
ACK'd.
The ICMP packet having the same source and destination address as the host does
not make sense because the MSS of both endpoints are exchanged in the sync phase
using the options header. So it's usually the routers in old/exotic/tunneled L2
segments that send such packets.

Under the "normal" conditions, the ICMP packet reaches the client end before the
ACK packet so the kernel is able to validate the ICMP message and cache the
PMTU. The retransmission is performed and the ACK is effectively ignored and
counted as duplicate ACK(on BSD systems).

In case of the ACK packet arriving first, the subsequent ICMP message can no
longer be mapped to the TCP socket as the sequence number has been updated when
the ACK is processed. On all modern OSes, nothing is done for the connection.
Although the original TCP connection is left in stalemate, the most OSes honor
the "invalid" ICMP message and cache the new PMTU so the subsequent TCP
connections to the host as long as the cached entry is valid(10 minutes).
## Different result on various OSes
- Windows and FreeBSD(IOS and Macos): the ICMP message is honored "globally" -
subsequent TCP connections will use the cached PMTU, giving positive result
- Linux: the ICMP messages not mapped to any active socket are ignored
In the function `tcp_v6_err()` from Linux kernel `net/ipv6/tcp_ipv6.c`:
https://github.com/torvalds/linux/blob/4376e966ecb78c520b0faf239d118ecfab42a119/net/ipv6/tcp_ipv6.c#L436C1-L436C35
```c
if (sk->sk_state != TCP_LISTEN &&
!between(seq, snd_una, tp->snd_nxt)) {
__NET_INC_STATS(net, LINUX_MIB_OUTOFWINDOWICMPS);
goto out;
}
```
`snd_una` and `tp->snd_nxt` should be the same value because there is no packet
sent to the server after the initial large packet. `seq` is extracted from the
TCP header in the ICMP message payload. It does not fall into the range because
the values are updated by the ACK packet.
The `LINUX_MIB_OUTOFWINDOWICMPS` counter can be checked using the command:
```sh
netstat -ts | grep -i icmp
```
Output:
```
IcmpMsg:
5 ICMP packets dropped because they were out-of-window
```
## WAN optimization of Tele2
At first, oblivious of the fact that the large packet of culprit is ACK'd by the
test host, I focused on the fact that the window size of the TCP packet returned
in the ICMPv6 message is somehow altered by the ISP nodes. I confirmed this by
doing the test on an AWS instance. It seems that the AWS does not alter the
contents of TCP packets in any way. After a brief searching, I found that some
WAN optimization and QoS implementations alter the window size value of TCP
packets. Again, this had nothing to do with the problem: Linux kernel does not
care about the window value when mapping the raw packets to the sockets.
The out of order issue is probably from mult-path set up between the ISP
nodes(rr L2 bonding, routers with same cost ...) and/or active queue management.
An ACK packet with no payload is smaller than an ICMPv6 message with payload so
it will have higher chance of being processed first. Not ideal(forcing endpoints
to utilize more memory for reordering), but modern systems are implemented to
work under these conditions.
|