Tuesday, 26 April 2016

Some of the common method for Exchange Server Performance Diagnosing and trace &logs collection method.




Here I am explaining some of the common method to Diagnosing Performance Issues and trace &logs collection method.

ExPerfWiz:

ExPerfWiz is a PowerShell based script to help automate the collection of performance data on Exchange 2007, 2010 and 2013 servers.  Supported operating systems are Windows 2003, 2008, 2008 R2, 2012 and 2012 R2.

The default behavior of the script is to create a rolling BLG file that will roll to a new log when the maximum size of the log has been reached up to a maximum of 8 hours. For Windows 2008 servers, this is based on time as the -max parameter for logman.exe stops the data collection when the maximum log file size has been reached. There is logic in the script to prevent you from changing the maximum size of the BLG files on Windows 2008 servers.

.\experfwiz.ps1 -threads -duration 24:00:00 -interval 5 -filepath <location>



Netstat:

Netstat is a tool we can use to tracking down which process identifier (PID) has a port open is quite easy when netstat is run with the -a -n -o combination of parameters

Eg: netstat –ano

We can create scheduler task to run the netstat tool.

Create a scheduled task - run task.bat every 5 minutes whether they are logged on or not
In task.bat, put these 3 lines
set day=%date:~0,2%
echo %time% %date% >> %day%.txt
netstat -ano >> %day%.txt


NETSH:-

Netsh use to trace the network trace to capture the network traffic.

Steps to collect netsh trace from server.

1. Open command prompt as administrator on the server and the client machine

2. Issue the following command to start the network trace on the server and the client machine

a. netsh trace start scenario=netconnection fileMode=circular maxsize=2048 tracefile=c:\traceinfo.etl capture=yes

<< Once we simulate the issue, then we can stop the trace>>
netsh trace stop

To capture on particular address example as below

'netsh trace start capture=yes Ethernet.Type=IPv4  IPv4.Address=192.168.1.1'


Procdump:

ProcDump is a command-line utility whose primary purpose is monitoring an application for CPU spikes and generating crash dumps during a spike that an administrator or developer can use to determine the cause of the spike. ProcDump also includes hung window monitoring (using the same definition of a window hang that Windows and Task Manager use), unhandled exception monitoring and can generate dumps based on the values of system performance counters. It also can serve as a general process dump utility that you can embed in other scripts.

Eg:Procdump -mp store.exe -s 30 -n 3 -accepteula c:\file_name.dmp



RCA logs collection:-

These log files have connection information for the various clients

Location:-

%ExchangeInstallDir%\Logging\RPC Client Access





Memory Dump:-

Memory dump primarily identifies a problem or error within the operating system or any installed application within the system. Typically, memory dump provides information about the last state of the programs, applications and system before they were terminated or crashed. This information consists of memory locations, program counters, program state and other related details. It is displayed on-screen and also creates a system log file for viewing/referencing later. After memory dump, the computer is generally unavailable or inaccessible until it’s rebooted. Memory dump can also be caused by memory leak, when the system is out of memory and can no longer continue its operations.

Please check here to know how to generate memory dump manually.


How to read memory dump:-





Sunday, 24 April 2016

Exchange-2013 migration-Kerberos-authentication with ASA and SPN





I would like to share interesting experience with Kerberos and ASA accounts during the Exchange 2013 migrations. We are in the process of migrating from Exchange 2010 to Exchange 2013.Basically we have 3 regions APAC, EMEA and US. These three regions we have Exchange gateway servers to accept the external connections and send/receive emails from external.

Our migration plan is like migrate each site one by one and pick each server from each site. So this issues started once we completed the migrations of a APAC region and  EMEA migration phase,  just  after we rollout Kerberos authentication and ASA account in APAC site.

Let me explain our environment. Outlook uses SCP records in the AD to discover Autodicover endpoint, but skype and all uses SRV record configured with as eg: Autodiscover.abc.com as host for the autodiscover service. As a result client talk to the autodiscover.abc.com which resolved to mail.abc.com  which in turn by resolved to respective LBs (APAC.abc.com) , EMEA (EMEA.abc.com) and US (us.abc.com).

As we already migrated APAC and EMEA to Ex2013 and in APAC we have flipped the SPN record to Exchange 2013 ASA (newly created for Exchange 2013 ). As a result us.abc.com is sitting on different ASA (Existing one for Exchange 2010) than autodiscover.abc.com This breaks Kerberos authentication as encryption can’t be passed from one ASA (Exchange 2010) to another (Exchange 2013).

This was causing lot of issues, like skype people  unable to see the presence information etc. and they are getting authentication message.

Unfortunately we discovered that there is absolutely no way to have two Exchange versions active behind a load balancer without causing authentication prompts, user access issues and a variety of other impacts to connected systems . These topics have caused a lot of noise.


Some background:

How skype client gets to calendar info:

1. userA@EMEA.abc.com looks up for presence of userB@EMEA.abc.com

2. Skype client looks for srv record to retrieve endpoint for EMEA.abc.com

3. it finds record _autodiscover._tcp.emea.abc.com which points to autodiscover.abc.com

4. autodiscover.abc.com resolves to EMEA.abc.com

5. EMEA.abc.com resolves to   VIP

6. client connects to Exchange, authenticates via Kerberos and retrieves necessary info



What's happening now:

1. userA@emea.abc.com looks up for presence of userB@emea.abc.com

2. Skype client looks for srv record to retrieve endpoint for emea.abc.com

3. it finds record _autodiscover._tcp.emea.abc.com which points to autodiscover.abc.com

4. autodiscover.abc.com resolves to emea.abc.com

5. Emea.abc.com resolves to   VIP

6. LB routes the session to one of the Exchange servers:
o   If the session is routed to any of the Exchange 2013, Client looks for Autodiscover.abc.com and fails since Autodiscover.abc.com which is assigned to Exchange 2010 ASA
o   If the session is routed to the l Exchange 2010 server which still has the Autodiscover.abc.com entry, its succeeded

What was the work around we implemented here is that in APAC is pointing SRV records in DNS to point to APAC LB name space which already configured for Kerberos authentication.This bypasses Autodiscover.abc.com(which is assigned to Exchange 2010 ASA) as a result only one ASA is in play.Skype clients are able to retriev Autodiscover information and talk to Exchange.The same work around we implemented EMEA as we flip SPNS for “Autodiscover.abc.com(currently in Exchange 2010 ASA) and  EMEA LB name space to Exchange 2013 ASA.After which SRV record should be updated for SMTP doamins.

The solution lays in upgrading the remaining servers to Exchange 2013 and flipping US LB name space(the remaining URL) to Exchange 2013 ASA and configure servers in US to use 2013 ASA credentials. Then all region related URLS will be connected to the same ASA and Kerberos ticket won’t be broken as one URL is redirected to another URL.After this we will need to restore SRV record to point back to “Autodicover.abc.com”


Some information on Kerberso, NTLM, SPN, ASA etc.


When Exchange 2010 SP1 launched their main focus was on to getting in to the feature that made it possible for MAPI clients (usually internal Outlook clients) to connect to a load balanced CAS array to be able to authenticate with Exchange using Kerberos authentication. Previous versions of Exchange server supported Kerberos authentication since the MAPI clients connected to the mailbox server FQDN and not FQSN pointing at a load balance in front of a CAS array. With Exchange 2010 RTM, there was no way for MAPI clients to authenticate using Kerberos authentication


ASA- (Alternate Service Account ): It is using for configuring Kerberos on CAS server.



An SPN is a unique identifier for a service on a network that uses Kerberos authentication. It consists of a service class, a host name, and a port. On a network that uses Kerberos authentication, an SPN for the server must be registered under either a built-in computer account (such as NetworkService or LocalSystem) or user account. SPNs are registered for built-in accounts automatically. However, when you run a service under a domain user account, you must manually register the SPN for the account you want to use

Difference in NTLM and Kerberos  authentication:

As the number of the connection increase, there is potential for a bottleneck in terms of handling NTLM authentication.

Whereas Kerberos authentication is efficient and faster, as when NTLM must connect to a domain controller to authenticate client. With Kerberos authentication protocol on other hand, the server is not required to go to a domain controller. The sever can authenticate the client by examining the credential provided by the client. Clients can obtain the credentials for a particular server once and then reuse those credentials throughout a network logon session




References:-


































Wednesday, 20 April 2016

How Large Send offload (TCP Chimney) on NICs cause network issue on Servers





I would like to share another experience with Exchange server connectivity issues.

We noticed that we are getting unexpected monitoring alerts stating that database fail-over from Active to passive nodes. Also, user incidents related to outlook connectivity issues; it is actually intermittent in nature. During the course of our investigation we noticed that server lost the connection and drop in the ping request. So, we conclude that due to the network drop the server lost the connection and it cause for the database fail-over and it lead to the client connectivity issue .This issue is sporadic in nature. This issue was reported from multiple sites on multiple servers. We couldn’t find any common symptoms on these issues but we noticed that this is occurring during the peak business hours.
This issue is escalated to multiple teams like networking, AD and security team (Symantec end point protection) and they confirmed no issues from their end and those areas have been ruled out.
Finally left with platform (windows server 2008R2) and application (Exchange server 2010), to narrow down this issue we collected outlook logs from the client side and analyzed.



In the above log we could see that client is waiting from server and it does not receive it and then outlook close the connection.
As the network team (Load balance and switch side) confirmed that there is no issue from network end, also in the above outlook logs it is showing as keep alive flag and RPC timeout. Hence we have raised a change to alter the value as per MS. Currently connection timeout for the RPC is not set which means it would take idle time out set at the IIS.
Minimum Connection Timeout. Configure the RPC timeout on Exchange servers to make sure that components which use RPC will trigger a keep alive signal within the time frame you specify here. This will help keep network devices in front of Exchange from closing sessions prematurely:
HKLM\Software\Policies\Microsoft\Windows NT\RPC\MinimumConnectionTimeout
DWORD  0x00000078 (120 decimal)
Currently this is not set explicitly

Set Keep alive timeout. Determines how often TCP sends keep-alive transmissions. TCP sends keep-alive transmissions to verify that an idle connection is still active. Many network devices such as load balance and firewalls use an aggressive 5 minute idle session time out. This will help keep those devices from closing a session prematurely:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveTime
DWORD value = 900000
Currently this is set to 300000

But after amend the above settings still we used to get the intermittent connectivity error. Also we see network error in another set of outlook logs.



We have analyzed the network trace collected from the server and found that,

1.       In the network traces we  saw  that TCP payload lengths are larger than the stand 1460 for TCP.
a.       This is usually indicative of TCP offloading being enabled on the servers
b.       TCP offloading has been known to cause overloading of nics and create network outages on servers. (Pasting below TCP payload length column)



 2. Also we noticed that heartbeat traffic that stop being sent between cluster servers.


3. We can also motieced that network traces have packets that are not making it to the NDIS layer


The absence of packets in the server side trace means that the packets are not getting the NDIS layer on the server where NETSH views the packets.
On a windows server the NDIS layer sits toward the bottom of the TCP/IP stack above the 3rd party filter drivers, network card, network card driver and teaming software but below the application protocols, winsock api and other transitional protocols.

In the scenarios where we do not see the packets at the NDIS layer we need try and figure out if the packets are actually getting the server and we need focus our attention on the components below the NDIS layer on the server and physical hops before the server

During the trouble-shooting collected data at the switch port where the server was plugged in so that we can see that packets are getting to the port, this allowed us focus our troubleshooting further  if we can see the packets at the port but not at NDIS then we know there is something between the physical port and NDIS causing  issue. Converse if we do not see the packets at the switch port then we have something off box preventing the packets from getting to the server.

Action performed
-----------------------------------------------------------
1.       Disable Large Send offload (TCP Chimney) on nics in the problem servers.
a.       This steps is done in the teaming software or on the properties of the physical nic driver.
2.       Update network card drivers to the latest and greatest available for the server.
3.       Remove all 3rd party filter drivers from the server for testing.
a.       This step can be done for a period of time to make sure that the AV software is reinstalled as required by the business
4.       Disassemble nic team on the server and use a single nic configuration for testing

Note: TCP Chimney (sometimes referred to as TCP Offloading)

This feature is designed to take processing of the network such as packet segmentation and reassembly processing tasks, from a computer's CPU to a network adapter that supports TCP Chimney Offload. This has the effect of reducing the workload on the host CPU and moving it to the NIC, allowing both the Host OS to perform quicker and also speed up the processing of network traffic.
http://blogs.technet.com/b/onthewire/archive/2014/01/21/tcp-offloading-chimney-amp-rss-what-is-it-and-should-i-disable-it.aspx

The issue is resolved by disabe TCP Chimney


Reference Article on TCP Chimany(Large Send offload:- http://blogs.technet.com/b/onthewire/archive/2014/01/21/tcp-offloading-chimney-amp-rss-what-is-it-and-should-i-disable-it.aspx