Wednesday, 20 April 2016

How Large Send offload (TCP Chimney) on NICs cause network issue on Servers





I would like to share another experience with Exchange server connectivity issues.

We noticed that we are getting unexpected monitoring alerts stating that database fail-over from Active to passive nodes. Also, user incidents related to outlook connectivity issues; it is actually intermittent in nature. During the course of our investigation we noticed that server lost the connection and drop in the ping request. So, we conclude that due to the network drop the server lost the connection and it cause for the database fail-over and it lead to the client connectivity issue .This issue is sporadic in nature. This issue was reported from multiple sites on multiple servers. We couldn’t find any common symptoms on these issues but we noticed that this is occurring during the peak business hours.
This issue is escalated to multiple teams like networking, AD and security team (Symantec end point protection) and they confirmed no issues from their end and those areas have been ruled out.
Finally left with platform (windows server 2008R2) and application (Exchange server 2010), to narrow down this issue we collected outlook logs from the client side and analyzed.



In the above log we could see that client is waiting from server and it does not receive it and then outlook close the connection.
As the network team (Load balance and switch side) confirmed that there is no issue from network end, also in the above outlook logs it is showing as keep alive flag and RPC timeout. Hence we have raised a change to alter the value as per MS. Currently connection timeout for the RPC is not set which means it would take idle time out set at the IIS.
Minimum Connection Timeout. Configure the RPC timeout on Exchange servers to make sure that components which use RPC will trigger a keep alive signal within the time frame you specify here. This will help keep network devices in front of Exchange from closing sessions prematurely:
HKLM\Software\Policies\Microsoft\Windows NT\RPC\MinimumConnectionTimeout
DWORD  0x00000078 (120 decimal)
Currently this is not set explicitly

Set Keep alive timeout. Determines how often TCP sends keep-alive transmissions. TCP sends keep-alive transmissions to verify that an idle connection is still active. Many network devices such as load balance and firewalls use an aggressive 5 minute idle session time out. This will help keep those devices from closing a session prematurely:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveTime
DWORD value = 900000
Currently this is set to 300000

But after amend the above settings still we used to get the intermittent connectivity error. Also we see network error in another set of outlook logs.



We have analyzed the network trace collected from the server and found that,

1.       In the network traces we  saw  that TCP payload lengths are larger than the stand 1460 for TCP.
a.       This is usually indicative of TCP offloading being enabled on the servers
b.       TCP offloading has been known to cause overloading of nics and create network outages on servers. (Pasting below TCP payload length column)



 2. Also we noticed that heartbeat traffic that stop being sent between cluster servers.


3. We can also motieced that network traces have packets that are not making it to the NDIS layer


The absence of packets in the server side trace means that the packets are not getting the NDIS layer on the server where NETSH views the packets.
On a windows server the NDIS layer sits toward the bottom of the TCP/IP stack above the 3rd party filter drivers, network card, network card driver and teaming software but below the application protocols, winsock api and other transitional protocols.

In the scenarios where we do not see the packets at the NDIS layer we need try and figure out if the packets are actually getting the server and we need focus our attention on the components below the NDIS layer on the server and physical hops before the server

During the trouble-shooting collected data at the switch port where the server was plugged in so that we can see that packets are getting to the port, this allowed us focus our troubleshooting further  if we can see the packets at the port but not at NDIS then we know there is something between the physical port and NDIS causing  issue. Converse if we do not see the packets at the switch port then we have something off box preventing the packets from getting to the server.

Action performed
-----------------------------------------------------------
1.       Disable Large Send offload (TCP Chimney) on nics in the problem servers.
a.       This steps is done in the teaming software or on the properties of the physical nic driver.
2.       Update network card drivers to the latest and greatest available for the server.
3.       Remove all 3rd party filter drivers from the server for testing.
a.       This step can be done for a period of time to make sure that the AV software is reinstalled as required by the business
4.       Disassemble nic team on the server and use a single nic configuration for testing

Note: TCP Chimney (sometimes referred to as TCP Offloading)

This feature is designed to take processing of the network such as packet segmentation and reassembly processing tasks, from a computer's CPU to a network adapter that supports TCP Chimney Offload. This has the effect of reducing the workload on the host CPU and moving it to the NIC, allowing both the Host OS to perform quicker and also speed up the processing of network traffic.
http://blogs.technet.com/b/onthewire/archive/2014/01/21/tcp-offloading-chimney-amp-rss-what-is-it-and-should-i-disable-it.aspx

The issue is resolved by disabe TCP Chimney


Reference Article on TCP Chimany(Large Send offload:- http://blogs.technet.com/b/onthewire/archive/2014/01/21/tcp-offloading-chimney-amp-rss-what-is-it-and-should-i-disable-it.aspx










No comments:

Post a Comment