Recently I was asked to troubleshoot a performance problem. The developers had written some synthetic load tests and were seeing their tests come grinding to a halt a minute or two into the test.
The test involved sending a simple message and the server echoing the same in the response. Each request created a new TCP socket.
The physical setup was pretty basic, looking something like the following
|Server|---|TOR Switch|---|AGG Switch|---|FW|---|AGG Switch|---|TOR Switch|---|Server|
The firewall servers as the router for the subnets. All paths are 10gig or greater.
The servers are VMware hosts running multiple guests. Guests are all running CentOS linux 6.5.
First things first, check some utilization metrics and the firewall logs. VMware hosts aren't busy and the guest is only pushing about 40mbit of traffic through the NICs, nothing seemed out of place there, certainly nothing that should be pushing against infrastructure limits.
Next check was for the firewall logs. We use Splunk in our environment and we feed all of our logs there. The Firewall logs gave me some concern right off the bat. In under a minute they had opened up about 80,000 sessions before the volume trailed off.
Now my instinct here was to investigate that further. As you know, each TCP session has to bind to a unique outbound port. Each server has, at most, 65,535 outbound ports to bind to. Right away I became concerned about a port exhaustion issue. The question was, which device was running into limits? Were the sessions closing properly, was the Linux guest recycling the ports correctly? Were the sessions being closed out of the firewall correctly?
Next step was to request a capture from both sides. This particular network doesn't have physical taps and we don't have SPAN setup in the dVswtich yet so we requested the server admins run a TCPDUMP on each guest so we could compare the results.
There is always a method for filtering your trace files to find something relevant. Too much filtering and you can miss the important parts, too little and the noise is too high to sift through. In this instance I was concerned about port exhaustion so I knew right away what level of filtering I wanted to apply. I quickly found at the end of the trace that there were sessions trying to open, but not being able to (SYN packets being re-sent). That flow was using source port 45760 and all flows are using destination port 9100. I want to see every example of source port 45760 to see how many times it was used and when so the Wireshark display filter was a simple "tcp.port==45760".
This is a screenshot from Wireshark after filtering down on that port. Column 2 is the delta time since the previous packet shown.
As you can see source port 45760 was used, closed, and attempted to be re-used about 19 seconds later. The second connection wasn't successful, it never received a response from the other system.
Luckily we did a trace at both ends so we can check to see if that second socket ever made it there. The trace file from the destination only shows one socket ever used source port 45760 and those SYN packets never were received.
In a few minutes we know it isn't an issue with the destination server, and the client server is able to bind again to 45760 so it doesn't appear to be a port exhaustion issue on the client. That leaves us with the firewall in the middle.
Now our firewall is a Juniper SRX 3600. It is important to know what happens when a session is used in the firewall. All firewall maintain a session table. We can't open two simultaneous sessions using the same 5 TUPLE (Source IP, Source Port, Dest IP, Dest Port, and Protocol). while the behavior will vary by protocol (i.e. TCP vs UDP), with TCP things are pretty straight forward because the client and server keep track of the session state and it is transmitted in the communication flow, so the firewall simply has to monitor the flow and watch for the goodbyes to happen (FINs or RSTs in packet lingo).
In our screenshot above you can see the goodbye messages start at packet 13. Once the first FIN packet is received the firewall marks the session as closing. It is too soon to close the session now because we want to give both sides time to finish gracefully, so we move the session to a TIME_WAIT state. If we see the goodbye messages complete we clear the session from the table immediately, otherwise we wait some period and then close the session. Junipers default is 150 seconds to wait after seeing first FIN packet.
The behavior of the client is a bit screwy. We would expect the client to send one final ACK, but instead sends a RST (or two). This doesn't satisfy the firewalls requirements to complete the TIME_WAIT and the firewall keeps that in state until the 150s timer expires. We quickly decided that the application behavior could not be modified so we began to investigate the firewall timers.
Juniper sets that TIME_WAIT as a global option.
set security flow tcp-session time-wait-state session-timeout <2-600 seconds>
Lowering that session timer to 20 improved performance. 10 seconds was nearly unthrottled and 5 seconds let the test run at full performance.
The only lingering question I have is about the RST. I would have expected the firewall to clear the session immediately on the RST but I haven't found that documentation from Juniper yet so it looks like I need to do some lab testing. Does the RST also trigger the same 150s wait or was the fact that the session was already in the TIME_WAIT a factor?
This was a fun little problem, we got to look at some packets and solve some problems, a great date as a network engineer.