aboutsummaryrefslogtreecommitdiff
path: root/writeups
diff options
context:
space:
mode:
authorDavid Timber <dxdt@dev.snart.me>2024-07-09 01:09:36 +0200
committerDavid Timber <dxdt@dev.snart.me>2024-07-09 01:19:45 +0200
commitef5b99386d8c021d9cba38a7a615cdbfcbe80477 (patch)
tree069d31612da0232a3d294cc450629caab282a8d3 /writeups
parenta362db911c58de43e0a2c6c88aaf9f15a9591025 (diff)
Add writeups/test-ipv6.com
Diffstat (limited to 'writeups')
-rw-r--r--writeups/test-ipv6.com/broken_mtu1280.test-ipv6.com.md87
-rw-r--r--writeups/test-ipv6.com/image.pngbin0 -> 55866 bytes
-rw-r--r--writeups/test-ipv6.com/ss_aws-instance.pngbin0 -> 322028 bytes
3 files changed, 87 insertions, 0 deletions
diff --git a/writeups/test-ipv6.com/broken_mtu1280.test-ipv6.com.md b/writeups/test-ipv6.com/broken_mtu1280.test-ipv6.com.md
new file mode 100644
index 0000000..fe4bd17
--- /dev/null
+++ b/writeups/test-ipv6.com/broken_mtu1280.test-ipv6.com.md
@@ -0,0 +1,87 @@
+# False PMTUD result due to mtu1280.test-ipv6.com not set up correctly
+I came across a strange problem where **test-ipv6.com** test fails while IPv6
+connectivity works just fine. Turns out, there was nothing wrong with the
+connectivity provided by Comviq, low-cost MVNO of Tele2. In summary, the false
+result on the particular network is due to the following factors.
+
+- mtu1280.test-ipv6.com is not set up correctly. The ACK packet for the large
+ packets still make it to the client
+- Out of order delivery of IP packets over Tele2's mobile network
+- The strict validation of ICMPv6 packets in Linux kernel
+
+## mtu1280.test-ipv6.com ACK's large packets
+It seems that the packet too large message is "soft-simulated" on the host
+rather than on a node that sits between the test host and the internet because
+the ICMP packet is originated from the same source address and the large packet
+that is not supposed to be undeliverable to the host is apparently delivered and
+ACK'd.
+
+The ICMP packet having the same source and destination address as the host does
+not make sense because the MSS of both endpoints are exchanged in the sync phase
+using the options header. So it's usually the routers in old/exotic/tunneled L2
+segments that send such packets.
+
+![Wireshark on AWS instance](https://raw.githubusercontent.com/dxdxdt/gists/master/writeups/test-ipv6.com/ss_aws-instance.png)
+
+Under the "normal" conditions, the ICMP packet reaches the client end before the
+ACK packet so the kernel is able to validate the ICMP message and cache the
+PMTU. The retransmission is performed and the ACK is effectively ignored and
+counted as duplicate ACK(on BSD systems).
+
+![Wireshark on Comviq network](https://raw.githubusercontent.com/dxdxdt/gists/master/writeups/test-ipv6.com/image.png)
+
+In case of the ACK packet arriving first, the subsequent ICMP message can no
+longer be mapped to the TCP socket as the sequence number has been updated when
+the ACK is processed. On all modern OSes, nothing is done for the connection.
+Although the original TCP connection is left in stalemate, the most OSes honor
+the "invalid" ICMP message and cache the new PMTU so the subsequent TCP
+connections to the host as long as the cached entry is valid(10 minutes).
+
+## Different result on various OSes
+- Windows and FreeBSD(IOS and Macos): the ICMP message is honored "globally" -
+ subsequent TCP connections will use the cached PMTU, giving positive result
+- Linux: the ICMP messages not mapped to any active socket are ignored
+
+In the function `tcp_v6_err()` from Linux kernel `net/ipv6/tcp_ipv6.c`:
+https://github.com/torvalds/linux/blob/4376e966ecb78c520b0faf239d118ecfab42a119/net/ipv6/tcp_ipv6.c#L436C1-L436C35
+
+```c
+ if (sk->sk_state != TCP_LISTEN &&
+ !between(seq, snd_una, tp->snd_nxt)) {
+ __NET_INC_STATS(net, LINUX_MIB_OUTOFWINDOWICMPS);
+ goto out;
+ }
+```
+
+`snd_una` and `tp->snd_nxt` should be the same value because there is no packet
+sent to the server after the initial large packet. `seq` is extracted from the
+TCP header in the ICMP message payload. It does not fall into the range because
+the values are updated by the ACK packet.
+
+The `LINUX_MIB_OUTOFWINDOWICMPS` counter can be checked using the command:
+```sh
+netstat -ts | grep -i icmp
+```
+
+Output:
+```
+IcmpMsg:
+ 5 ICMP packets dropped because they were out-of-window
+```
+
+## WAN optimization of Tele2
+At first, oblivious of the fact that the large packet of culprit is ACK'd by the
+test host, I focused on the fact that the window size of the TCP packet returned
+in the ICMPv6 message is somehow altered by the ISP nodes. I confirmed this by
+doing the test on an AWS instance. It seems that the AWS does not alter the
+contents of TCP packets in any way. After a brief searching, I found that some
+WAN optimization and QoS implementations alter the window size value of TCP
+packets. Again, this had nothing to do with the problem: Linux kernel does not
+care about the window value when mapping the raw packets to the sockets.
+
+The out of order issue is probably from mult-path set up between the ISP
+nodes(rr L2 bonding, routers with same cost ...) and/or active queue management.
+An ACK packet with no payload is smaller than an ICMPv6 message with payload so
+it will have higher chance of being processed first. Not ideal(forcing endpoints
+to utilize more memory for reordering), but modern systems are implemented to
+work under these conditions.
diff --git a/writeups/test-ipv6.com/image.png b/writeups/test-ipv6.com/image.png
new file mode 100644
index 0000000..1b712be
--- /dev/null
+++ b/writeups/test-ipv6.com/image.png
Binary files differ
diff --git a/writeups/test-ipv6.com/ss_aws-instance.png b/writeups/test-ipv6.com/ss_aws-instance.png
new file mode 100644
index 0000000..0d415e9
--- /dev/null
+++ b/writeups/test-ipv6.com/ss_aws-instance.png
Binary files differ