Working with Get-CacheClusterHealth and judge the results in scripts/checks

By | 2016-08-12

The AppFabric Distributed Cache Service is an important yet fragile part of SharePoint 2013. To get the last status and for example use it in your monitoring might be a wanted solution to get a trigger when something is wrong with it.
The command “Get-CacheClusterHealth” gives a lot of textual results about the current state of all caches and cachehosts, which can be good, but is difficult to use and to interpret. The result is of type “Microsoft.ApplicationServer.Caching.Commands.ClusterHealth”.
The formatting is text, not objects as we are used to in Powershell. Furthermore, the structure is not consistent as such to easily convert to objects, the hostname is mentioned once and then all NamedCaches are listed without referencing the hostname again.

The output is explained in this Microsoft article: technet.microsoft.com/en-us/library/ff921010.aspx

The result is nicely described in this table on referenced TechNet article:

Health Category Description
Healthy The cache is operating normally. This is the target state for all caches.
UnderReconfiguration The cache is under reconfiguration. This is an internal state that may have several causes, but it should be temporary and resolve to healthy.
NotPrimary The cache is not currently available. This can happen when secondary copies are promoted to primary. During this transition, the cache may temporarily have a state of NotPrimary. This state should typically resolve to healthy.
NoWriteQuorum The cache is read-only, because the cache is unable to create the required number of replicas on secondary cache hosts. This occurs when the cache has the high availability option enabled (Secondaries = 1). In this scenario, there must be at least two running cache hosts in the cluster, one for the primary copy of the cached item and another for the secondary copy.
Throttled The cache is read-only, because the cache host is in a throttled memory state. This is a low-memory condition.

 

The actual output looks like this (fragment):

 

Writing code to convert the result

I couldn’t find a working solution to handle the data, so I wanted to create my own. Its based on regular expressions. My inspiration comes from this example: msgoodies.blogspot.nl/2008/12/matching-multi-line-text-and-converting.
It took me some time to get the regex working, but I finally found a way in which the CacheClusterHealth can be used in an automated script that can regularly check the Health and trigger an alert if anything is wrong.

In each line of output, the “property name” is stated before the value, like “HostName = SP2013FE2.local.nl” and “Healthy = 5,00”, in the end result we want to have each property associated to each NamedCache per Host. That last part needed some more fine-tuning.

HostName NamedCache Healthy UnderReconfiguration NotPrimary InadequateSecondaries Throttled
 SP2013Fe  DistributedSecurityTrimmingCache_<<GUID>>  5  0  0  0  0
 SP2013FE  DistributedSearchCache_<<GUID>>  5  0  0  0  0
 SP2013FE  DistributedViewStateCache_<<GUID>>  5  0  0  0  0
 …  …
 SP2013FE2  DistributedSecurityTrimmingCache_<<GUID>>  5  0  0  0  0

The script: regex

The first part consists of the regex, it has 2 blocks surrounded by round brackets, the difference between the 2 is that the first one is with hostname and the second without. That’s because we first encounter a namedcache that is preceded by the hostname, and after that the other namedcaches are listed for that same hostname without referencing the hostname again.

In this regex we lookup each property by it’s name and the equal sign, the \s+ is used as “one or more whitespace characters”. The actual value is extracted using /S+ which means “one or more characters of anything but whitespace type”. We make sure that the result includes each property name enclosed in the lesser-than and greater-than signs, which will be used in the Powershell Object creation.

 

 

The script: matches to objects

The next part handles the results of the regex matches, important is we group by hostname because we only get the hostname once for all caches that are related to that hostname, so we safe it in the $hostname variable.

 

Examples to determine the health based on the script

Update: CAUTION: This next part is fundamentally wrong!

 

The last part consists of some examples to use the data:

 

The end result might look like:

 

Leave a Reply

Your email address will not be published. Required fields are marked *