Wednesday, January 15, 2014

Zero Downtime of Coherence Infrastructure (24x7 Availability) as part of Planned Deployment Strategy


Coherence is a
 reliable in-memory data grid product offering OOTB failover & continuous availability with extreme scalability. But we at times, face challenges during Coherence deployment and tend to lean towards clean restart of entire Coherence Cluster. This defeats the purpose of 24x7 availability of data grid layer and eventually the availability of dependent applications as well.
I came across this discussion with several people and hence sharing my thoughts on the entire Coherence Deployment Strategy, which does not require any downtime ensuring continuous availability.

In my opinion, there are particularly three high-level scenarios with respect to Coherence deployment:

Scenario 1 - Deployment of Application, which is using Coherence Data Grid Layer
  • Problem Statement: Typically, this is the case when there are multiple web or native applications backed-up by Coherence data grid layer. Often, infrastructure team tends to restart Coherence Cluster during the process of deployment causing downtime to cache layer & eventually the entire application. This causes extended downtime of entire application (even hours) as clean restart of Coherence usually takes time.
  • Solution Approach: 
    • As a best practice, Coherence Cluster shutdown & restart should be avoided wherever possible. Coherence does not require to be cleanly restarted unless there are changes in libraries (which is second scenario below).
    • If there is requirement to clean-up existing Cache entries and replace them with new cache entries, then it is more of a change in application version maintenance of cache items than cache system. Typically, each cache item can have version information (getter method like getVersion()) attached to it and post deployment, previous version entries can be discarded by the application. 
    • You can also refer to Cache Invalidation Strategies, which comes as an OOTB feature in Coherence.
Scenario 2 - Deployment of Application with updated Coherence Application Libraries, which is using Coherence Data Grid Layer
  • Problem Statement: This scenario is applicable in cases where there is usage of Coherence Application Cache particularly where read-through or write-through patterns are implemented. In this case, application specific JAR files or libraries need to be updated on Coherence Nodes & hence infrastructure team tends to shutdown entire Coherence Cluster with clean restart.
  • Solution Approach: 
    • As a best practice, Coherence Cluster shutdown & restart should be avoided wherever possible.
    • A cyclic restart (or rolling restart) can help in this case along with version based maintenance & cache invalidation strategies of cache items (as explained in scenario 1).
    • Note that invalidation or cache item clean-up plays critical role as even if Coherence Nodes get restarted, data will get automatically backed-up in data grid layer (by other nodes). In essence, failover feature is acting against clean deployment in this case and hence need to be careful in clean-up approach in this case.
Scenario 3 - Coherence Configuration Change as part of Deployment
  • Problem Statement: This scenario is applicable in cases where there are changes in Coherence Configuration (Cluster Configuration or otherwise). Note that even if there is any difference (even minor) in configuration of any Coherence Node, it will get rejected by Coherence Cluster. For example, if there is change in Security Configuration (using override file) or TTL change or Coherence Edition Change.
  • Solution Approach: 
    • The easiest approach is to shutdown entire Coherence Cluster (JMX monitoring can help to make sure all Coherence Nodes are down) and post configuration change, restart all nodes. But it defeats our purpose of ZERO DOWNTIME.
    • If Zero downtime is needed, then we need to:
      • Setup an entirely new Coherence Cluster (e.g. by assigning a new multicast IP address or change of mutlicast port)
      • Make Configuration Changes &  do fresh deployment on new cluster
      • Do cyclic restart of dependent application servers using new Coherence Cluster setup
      • Discard Old Coherence Cluster post migration of old applications to new Coherence Cluster
There can be multiple other deployment scenarios possible but they can be variation of scenarios described above (at least in my mind).

Hope it helps to all those people who are seeking Zero Downtime Deployment without paying extra for other products like Oracle GoldenGate to achieve the same.

Disclaimer:

All data and information provided on this site is for informational purposes only. This site makes no representations as to accuracy, completeness, correctness, suitability, or validity of any information on this site and will not be liable for any errors, omissions, or delays in this information or any losses, injuries, or damages arising from its display or use. All information is provided on an as-is basis.This is a personal weblog. The opinions expressed here represent my own and not those of my employer or any other organization.