Hi everyone, this is Will Aftring again with the Windows Debug team, here to lay the groundwork for a new series on how to get started with network trace analysis.
This is not an introduction to networking. Many of the networking topics discussed in this document will be simplifications designed to get you up and running with trace analysis. There is the assumption that you are at least familiar with concepts like an IP address and ping.
It is also important to keep in mind that most of this document discusses networking in the context of the Windows operating system and your experience may differ with other operating systems.
For deeper understanding I recommend reading the relevant request for comments (RFC) for the topic you are curious about.
Why should you care
Being able to perform basic network trace analysis is about being able to save time and energy when applications or computers start to go awry.
- I can’t download Windows Updates!
- My download is slow!
- I can’t access a website!
- My application isn’t working!
Nearly all modern applications use the network to some degree and knowing how your specific technology interacts with the network will help you understand not only the portion that uses the network but also how your application is designed.
If you can perform a basic network trace it can help direct further analysis and prevent at least a few trips down rabbit holes (I'm looking at you DNS).
Most importantly, by having a basic understanding of network trace analysis you are expanding your repertoire and Accelerating your IT Career.
Networking Overview
Networking is as simple as sending a letter.
Let's say we want to send a letter to Randy, but he lives two towns over. You know that the letter is going to go through each of the town post offices on their way to Randy's house.
It is important in figuring out if your letter arrived to make sure that you check with Randy.
- You: "Hey Randy! Did you get the letter I sent?"
- Randy: "Nope, it never arrived."
- You: "That's weird. Let me investigate it!"
If we know that we personally took the letter to Post Office A and we know that the letter didn't arrive at Randy's house. Where should we look for the letter?
If you guessed Post Office B, you're correct. By looking for the letter at Post Office B we will know the following:
- Did the letter arrive at Post Office B? If yes, then we know either Post Office B didn't send it to Randy's house, or it got lost on the way to Randy's house.
- If it never arrives at Post Office B then we know that either Post Office A didn't send the letter or, it got lost on the way to Post Office B.
Now let's take the example above and remove our metaphor:
- Letter : Packet
- Your house : The client
- Randy's House : The server
- Post Office A : Your default gateway
- Post Office B : The router closest to the server
Typically, in network terminology, we use client as the machine initiating the connection and the server is the one receiving it.
If we know that our machine sent our packet to our default gateway, and it didn't arrive on the router closest to the server then we know it got dropped somewhere along the way.
Firewalls:
Firewalls are a necessity in our security conscious world, but they can be difficult to properly configure. And by default, if we have two Windows machines communicating with each other over the network, there will be at least two firewalls we traverse through. The senders Windows Firewall and the receivers Windows Firewall.
For the sake of the metaphor above let's think of a firewall as a wall with guards put in place by the Postmaster General between Post Office A and Post Office B.
The driver from Post Office A reaches the firewall between the two locations. And when the driver gets to the guard they say "Let me look at your letter. I have a list of houses that you're not allowed to deliver to. If your destination is one that list, I am throwing it away and you can't stop me."
Unfortunately for you, Randy's house is on the do not allow list and the letter is discarded by the firewall.
Focusing on the Windows Firewalls, when we think about what our traffic looks like when we add a firewall into the picture from a technical point of view it will look something like this:
Most packet capture tools work between Windows Filtering Platform (WFP) and NDIS. So, if we do not see our traffic on the sending side then it must have been dropped in the network stack. The usual suspects in cases like this are either:
- The application didn't actually send the traffic (Randy didn't send the letter)
- The firewall dropped the packet before the network trace (The destination is on the do not allow list, so we drop the letter)
Subnets:
We are only going to briefly touch on subnets because to reiterate, this is not an introduction to networking.
Subnets are logical groupings of IP addresses. If you are in the same subnet, you are in the same town and don't need to use the Post Office.
More information on subnetting .
Protocols
Communication between devices is done with different protocols that have their own uses and behaviors. (These behaviors are outlined in the protocols RFC).
There are thousands upon thousands of protocols and likely the application you care about will have its own not included on this list. Understanding the principles at play, however, will help you figure out what is happening.
The three that we will be covering in this post are Internet Control Message Protocol (ICMP), User Datagram Protocol (UDP), and Transmission Control Protocol (TCP). I have chosen these three since most of our specific application protocols either sit on top of these protocols or are a close enough comparison.
ICMP
Most people know of ICMP through the ping command. Ping creates an ICMP Echo Request and sends it to the selected destination. If the destination gets that request it will reply with a response.
Here I am pinging contoso.com :
C:\>ping contoso.com
Pinging contoso.com [192.168.2.100] with 32 bytes of data:
Reply from 192.168.2.100: bytes=32 time=1ms TTL=128
Reply from 192.168.2.100: bytes=32 time=5ms TTL=128
Reply from 192.168.2.100: bytes=32 time<1ms TTL=128
Reply from 192.168.2.100: bytes=32 time<1ms TTL=128
Ping statistics for 192.168.2.100:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 5ms, Average = 1ms
Pretty straight forward. Here is what that traffic looks like in a network trace:
- I have sent out an echo request to 192.168.2.100 from my machine 192.168.2.10
1 0.000000 192.168.2.10 192.168.2.100 ICMP 74 Echo (ping) request id=0x0001, seq=5/1280, ttl=128 (reply in 2)
- We know that something received the packet as we get an ICMP reply:
2 0.000864 192.168.2.100 192.168.2.10 ICMP 74 Echo (ping) reply id=0x0001, seq=5/1280, ttl=128 (request in 1)
As you can see in the ping output and the screenshot of Wireshark, we send 4 requests and get 4 replies.
Thinking back to our mailing a letter metaphor. It is the equivalent of sending Randy a letter saying "Hey" and they respond with "Hey".
We still need to give our letter to Post Office A. Post Office A still passes the letter to Post Office B and they need to pass the letter to Randy and their response needs to make it back.
UDP
Applications tend to use UDP if they are time sensitive as UDP has minimal overhead and can be pushed more quickly since we do not wait for any retransmission delays. Because of this with UDP we are more interested in packet trends as that is more indicative of a problem.
A good example of this is with heartbeat traffic in Windows failover clusters. The cluster doesn't care if the server it is talking to get any single packet. The cluster is more concerned if we have missed groups of packets as the service believes that the machine it is talking to is offline.
Here is an example of cluster heartbeat traffic in a network trace:
- Our client 172.16.4.105 is sending out a heartbeat packet to 172.16.4.123
181 13:18:37.4618346 0.9834003 172.16.4.105 172.16.4.123 RCP RCP: RCP_REQUEST(0) Sequence= 539958 (0x83D36) {UDP:29, IPv4:28, NDISPacCap_MicrosoftWindowsNDISPacketCapture:27, NetEvent:26}
- Because that machine is online, we get a response.
182 13:18:37.4620318 0.9835975 172.16.4.123 172.16.4.105 RCP RCP: RCP_RESPONSE(0X1) Sequence= 539958 (0x83D36) {UDP:29, IPv4:28, NDISPacCap_MicrosoftWindowsNDISPacketCapture:27, NetEvent:26}
- In our trace we can see that our client is not getting any response from our other node
250 13:18:38.4621245 1.9836902 172.16.4.105 172.16.4.123 RCP RCP: RCP_REQUEST(0) Sequence= 539959 (0x83D37) {UDP:29, IPv4:28, NDISPacCap_MicrosoftWindowsNDISPacketCapture:27, NetEvent:26}
303 13:18:39.4616752 2.9832409 172.16.4.105 172.16.4.123 RCP RCP: RCP_REQUEST(0) Sequence= 539960 (0x83D38) {UDP:29, IPv4:28, NDISPacCap_MicrosoftWindowsNDISPacketCapture:27, NetEvent:26}
417 13:18:41.4619259 4.9834916 172.16.4.105 172.16.4.123 RCP RCP: RCP_REQUEST(0) Sequence= 539962 (0x83D3A) {UDP:29, IPv4:28, NDISPacCap_MicrosoftWindowsNDISPacketCapture:27, NetEvent:26}
Our sequence numbers continue to increase instead of trying to resend the same data as before. One again this is because we care more about groups of packets being lost. The data in any singular packet is less important than the stream of data being intact.
Here is a screenshot of a cluster heartbeat network trace in Network Monitor:
TCP
TCP is the protocol for reliable data transmission. It requires additional overhead, but along with that overhead comes the knowledge of whether the data you have sent has been seen and acknowledged.
Because of this, if we see that a TCP packet goes unacknowledged, we will typically see a TCP retransmit where we send the same data again and hope we get a response. If the client continues to see the data it sends is unacknowledged, after a while the client will terminate the connection.
Regardless of what application is initiating the TCP traffic, the first thing we will see in a new connection is the TCP 3-Way Handshake. This allows us to communicate between two machines over the specified port. If the TCP handshake fails, then our application can do nothing over the network.
Here is an example of a TCP handshake to bing.com over port 80:
- Sending out our TCP SYN
3516 64240 14:08:51.463484 10.191.98.95 151.101.193.140 TCP 66 51169 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM=1
- Our server then responds with the TCP ACK
3527 29200 14:08:51.510126 151.101.193.140 10.191.98.95 TCP 66 80 51169 [SYN, ACK] Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 SACK_PERM=1 WS=512
- Our client sends out a TCP ACK to complete the handshake
3528 1026 14:08:51.510267 10.191.98.95 151.101.193.140 TCP 54 51169 80 [ACK] Seq=1 Ack=1 Win=262656 Len=0
- We can continue doing whatever our application reached out to the server to do. In this case, make a HTTP request.
3530 1026 14:08:51.511178 10.191.98.95 151.101.193.140 HTTP 208 GET / HTTP/1.1
Screenshot from Wireshark:
TCP Red Herrings
TCP is a more descriptive protocol when it comes to discussing what it is doing. There are flags, windows, and payloads but it is important to understand a few red herrings that are common with TCP:
- The TCP Reset:
This is not necessarily an issue. The TCP Reset isn't a situation where the sky is falling, and that the world is coming to an end and that the network is up in flames. When we send a TCP Reset, all it means is that we are closing our connection instantly instead of proceeding with a graceful closure. We are not only done talking but let's also free up the port we were using for this connection so that we can use it again!
The ONLY time I would interpret a TCP Reset as an issue is if we receive a TCP RESET or TCP ACK RESET to a TCP SYN packet that we had just sent out. This would mean that the port we are trying to communicate with outright refused the connection.
Typically, this is an indication of a lack of a listener, but I’ll go more in depth on listeners in a future blog post.
- The TCP Zero Window:
This is one of my favorite red herrings because it is an issue that has TCP in the name.
To really understand the TCP Zero Window, we need to understand what a TCP Window is.
Let’s say we have a conveyor belt, and our client is responsible for putting boxes on the conveyer belt and the application that we are talking to pulls the data off the conveyer belt.
And let’s say our application can pull one box off the belt every 20 seconds and the maximum number of boxes that can fit on the conveyor belt at once is 3.
So here our client is putting a box on the conveyer belt at a rate of 1 box every 20 seconds.
Window Size is the number of unoccupied box slots on the conveyor belt.
Now let’s say our Receiver Application is busy doing some other work and now they can only pull boxes at a rate of 1 box every 80 seconds.
Our Window starts to look like this:
After 20 seconds...
After 20 more seconds:
And after 20 MORE seconds:
OH NO! We have a Zero Window! What do we do from here!
It is simple. We wait.
We will wait, and our client will ask "Hey, can you receive more data yet?"
And the Receiver will either say “yes, send away.” Or “no I need more time to process this data.”
This process will continue until the application is able to start pulling data out of the buffer or the client decides this is taking too long and closes the connection.
In these situations, it is important to note that the application, NOT the operating system, is responsible for pulling this data out of the buffer.
Now that we have covered the basics, the next post will go into:
- Understanding network trace tooling
- Getting started with collecting network traces
- General information you would want along with traces.
Posted at https://sl.advdat.com/3ykit28https://sl.advdat.com/3ykit28