Alpine 镜像可以带来很小的镜像体积,所以大家比较热衷于使用这个镜像做基础镜像。但是实际上因为一些系统库是阉割版本,可能会导致一些意想不到的问题。例如这里想说的 DNS 的问题。
有一个同事报告说在他们的应用的 container 里面使用 curl 请求一个网址的时候,有一定概率会收到 Could not resolve host:
这个错误,但是也有一定概率会成功返回 200。同时在同一个 pod 的一个 sidecar container 里面使用 curl 访问相同的网址就没问题,总是会返回 200。
这个问题立刻就吸引了我的兴趣,这也太奇怪了。我立刻去他的 container 里面做了一些测试,发现 curl 命令会返回如下三种结果。
Couldn’t connect to server
Could not resolve host
- http code 200.
前面两种是错误,最后一个是期望的结果。这看着显然和 DNS 结果有关系,DNS 包太简单了,那就先从抓包搞起。
这个是在应用 container 里面请求的情况,可以看到确实没有返回任何的 A 记录。那这样得到 2 这个错误也正常了。
07:28:24.782832 IP ip-10-42-25-142.us-east-2.compute.internal.59668 > ip-172-21-0-2.us-east-2.compute.internal.domain: 40110+ A? 5df09v6.v.testdomain.net.sandbox.svc.cluster.local. (63)
07:28:24.783056 IP ip-10-42-25-142.us-east-2.compute.internal.59668 > ip-172-21-0-2.us-east-2.compute.internal.domain: 40740+ AAAA? 5df09v6.v.testdomain.net.sandbox.svc.cluster.local. (63)
07:28:24.795773 IP ip-172-21-0-2.us-east-2.compute.internal.domain > ip-10-42-25-142.us-east-2.compute.internal.59668: 40740 NXDomain 0/1/0 (138)
07:28:24.795953 IP ip-172-21-0-2.us-east-2.compute.internal.domain > ip-10-42-25-142.us-east-2.compute.internal.59668: 40110 NXDomain 0/1/0 (138)
07:28:24.796366 IP ip-10-42-25-142.us-east-2.compute.internal.51888 > ip-172-21-0-2.us-east-2.compute.internal.domain: 25695+ A? 5df09v6.v.testdomain.net.svc.cluster.local. (55)
07:28:24.796560 IP ip-10-42-25-142.us-east-2.compute.internal.51888 > ip-172-21-0-2.us-east-2.compute.internal.domain: 26285+ AAAA? 5df09v6.v.testdomain.net.svc.cluster.local. (55)
07:28:24.808936 IP ip-172-21-0-2.us-east-2.compute.internal.domain > ip-10-42-25-142.us-east-2.compute.internal.51888: 25695 NXDomain 0/1/0 (130)
07:28:24.809593 IP ip-172-21-0-2.us-east-2.compute.internal.domain > ip-10-42-25-142.us-east-2.compute.internal.51888: 26285 NXDomain 0/1/0 (130)
07:28:24.809682 IP ip-10-42-25-142.us-east-2.compute.internal.43867 > ip-172-21-0-2.us-east-2.compute.internal.domain: 27249+ A? 5df09v6.v.testdomain.net.cluster.local. (51)
07:28:24.809766 IP ip-10-42-25-142.us-east-2.compute.internal.43867 > ip-172-21-0-2.us-east-2.compute.internal.domain: 27859+ AAAA? 5df09v6.v.testdomain.net.cluster.local. (51)
07:28:24.822648 IP ip-172-21-0-2.us-east-2.compute.internal.domain > ip-10-42-25-142.us-east-2.compute.internal.43867: 27859 NXDomain 0/1/0 (126)
07:28:24.822700 IP ip-172-21-0-2.us-east-2.compute.internal.domain > ip-10-42-25-142.us-east-2.compute.internal.43867: 27249 NXDomain 0/1/0 (126)
07:28:24.822871 IP ip-10-42-25-142.us-east-2.compute.internal.59387 > ip-172-21-0-2.us-east-2.compute.internal.domain: 51131+ A? 5df09v6.v.testdomain.net.us-east-2.compute.internal. (64)
07:28:24.822985 IP ip-10-42-25-142.us-east-2.compute.internal.59387 > ip-172-21-0-2.us-east-2.compute.internal.domain: 51851+ AAAA? 5df09v6.v.testdomain.net.us-east-2.compute.internal. (64)
07:28:24.830804 IP ip-172-21-0-2.us-east-2.compute.internal.domain > ip-10-42-25-142.us-east-2.compute.internal.59387: 51851 NXDomain 0/1/0 (125)
07:28:24.835617 IP ip-172-21-0-2.us-east-2.compute.internal.domain > ip-10-42-25-142.us-east-2.compute.internal.59387: 51131 NXDomain 0/1/0 (125)
07:28:24.835755 IP ip-10-42-25-142.us-east-2.compute.internal.59529 > ip-172-21-0-2.us-east-2.compute.internal.domain: 25851+ A? 5df09v6.v.testdomain.net. (37)
07:28:24.835875 IP ip-10-42-25-142.us-east-2.compute.internal.59529 > ip-172-21-0-2.us-east-2.compute.internal.domain: 26301+ AAAA? 5df09v6.v.testdomain.net. (37)
07:28:24.837087 IP ip-172-21-0-2.us-east-2.compute.internal.domain > ip-10-42-25-142.us-east-2.compute.internal.59529: 26301| 2/0/0 CNAME g13v.prd.ads.aws.testdomain.net., CNAME gslb-aws-us-east-02.v.f
wmrm.net. (102)
07:28:24.837094 IP ip-172-21-0-2.us-east-2.compute.internal.domain > ip-10-42-25-142.us-east-2.compute.internal.59529: 25851| 2/0/0 CNAME g13v.prd.ads.aws.testdomain.net., CNAME gslb-aws-us-east-02.v.f
wmrm.net. (102)
但是等等,如果是这样,为什么那个 sidecar 里面没问题呢?DNS 返回的结果应该是公平的吧。。那就抓包 sidecar 的请求,发现见鬼了,人家那边确实总是会返回 ip 的。不过经过仔细观察抓包结果,发现实际上 sidecar 容器的请求也是会收到这个没有任何 A 记录的返回的,但是那边会立刻重新发送一个基于 TCP 协议的 DNS 查询,这个后续的查询能得到全部的 A 记录。
DNS 的 UDP 协议的请求设计上面只有一个来回,就是用一个数据包发送查询过去,然后对方在一个数据包里面返回查询结果。因为传输的时候包大小是有限制的(MTU),为了一次交互完事(如果是多个包要处理估计还会增加复杂度),DNS 的结果不能超过 512 字节。这个设定在早起互联网阶段是可以的,但是现在动不动就是面向全球用户的服务,服务器多了 ip 也就多了,这就可能会超过 512 字节了。这个时候怎么办?
在 DNS 协议的 rfc 里面有这么一段
In the absence of EDNS0 (Extension Mechanisms for DNS 0) (see below), the normal behaviour of any DNS server needing to send a UDP response that would exceed the 512-byte limit is for the server to truncate the response so that it fits within that limit and then set the TC flag in the response header. When the client receives such a response, it takes the TC flag as an indication that it should retry over TCP instead.
RFC 1123 also says:
… it is also clear that some new DNS record types defined in the future will contain information exceeding the 512 byte limit that applies to UDP, and hence will require TCP. Thus, resolvers and name servers should implement TCP services as a backup to UDP today, with the knowledge that they will require the TCP service in the future.
Existing deployments of DNS Security (DNSSEC) [RFC4033] have shown that truncation at the 512-byte boundary is now commonplace. For example, a Non-Existent Domain (NXDOMAIN) (RCODE == 3) response from a DNSSEC-signed zone using NextSECure 3 (NSEC3) [RFC5155] is almost invariably larger than 512 bytes.
Since the original core specifications for DNS were written, the Extension Mechanisms for DNS (EDNS0 [RFC2671]) have been introduced. These extensions can be used to indicate that the client is prepared to receive UDP responses larger than 512 bytes. An EDNS0-compatible server receiving a request from an EDNS0-compatible client may send UDP packets up to that client's announced buffer size without truncation.
However, transport of UDP packets that exceed the size of the path MTU causes IP packet fragmentation, which has been found to be unreliable in some circumstances. Many firewalls routinely block fragmented IP packets, and some do not implement the algorithms necessary to reassemble fragmented packets. Worse still, some network devices deliberately refuse to handle DNS packets containing EDNS0 options. Other issues relating to UDP transport and packet size are discussed in [RFC5625].
The MTU most commonly found in the core of the Internet is around 1500 bytes, and even that limit is routinely exceeded by DNSSEC- signed responses.
The future that was anticipated in RFC 1123 has arrived, and the only standardised UDP-based mechanism that may have resolved the packet size issue has been found inadequate.
所以当 DNS 服务器遇到需要返回的结果超过 512 字节的时候,应该需要在返回结果里面标记 TC 标记位。同时服务器端也应该同时支持 TCP 协议的查询,以及客户端在收到的结果里面发现 TC 标记位之后,应该使用 TCP 协议重新查询。这也正是 sidecar container 里面发生的情况。
回到本文的标题,应用容器里面的 curl 没有 follow 这个 rfc 的原因就是因为这个 container 是基于 alpine 的。这个 DNS 查询的问题在 alpine image 是个常见问题,一搜一大把,就不多说了。
问题 2 破案了,那么问题 1 和 3 是怎么回事?为什么有时候能连上呢? 这个具体原因不清楚,实际观察的结果是 DNS 查询会返回两种情况
- 就是上面抓包的情况,没有任何的 A 记录。
- 会返回一些 A 记录。
因为 DNS 服务器端不由我控制,无法 Debug。观察到的情况就是我们查询的域名会有两批 ip 地址,有一批比较多会超过 512,这个时候就是 1 这个情况。另外一批小于 512 会正常返回 A 记录,就是 2 这个情况。那其实还有问题,既然返回了 A 记录,为什么还同时有问题 1 和 3 这样的情况?
出现这个问题是因为 curl 有时候会尝试连接 ipv6 的地址。上面的抓包记录可以看到,DNS 请求总是同时查询 A 和 AAAA 记录。当 AAAA 先返回的时候,就会出现连接 ipv6 地址的情况。我们的 worker 节点为了省事关闭了 ipv6 支持,所以是联不通的。我试过使用 curl -4
这样的方式也无法让 curl 只连接 ipv4 的地址,我感觉这锅可能还是 alpine 的,但是没有去确认。