Coherence is a reliable in-memory data grid product offering OOTB failover & continuous availability with extreme scalability. But we at times, face challenges during Coherence deployment and tend to lean towards clean restart of entire Coherence Cluster. This defeats the purpose of 24x7 availability of data grid layer and eventually the availability of dependent applications as well.
I came
across this discussion with several people and hence sharing my thoughts on the
entire Coherence Deployment Strategy, which does not require any downtime ensuring continuous availability.
In my opinion, there are particularly three high-level scenarios with respect to Coherence deployment:
Scenario
1 - Deployment of Application, which is using Coherence Data Grid Layer
- Problem Statement: Typically, this is the case when
there are multiple web or native applications backed-up by Coherence data
grid layer. Often, infrastructure team tends to restart Coherence Cluster
during the process of deployment causing downtime to cache layer &
eventually the entire application. This causes extended downtime
of entire application (even hours) as clean restart of Coherence
usually takes time.
- Solution Approach:
- As a best practice, Coherence Cluster shutdown &
restart should be avoided wherever possible. Coherence does not
require to be cleanly restarted unless there are changes in libraries
(which is second scenario below).
- If there is requirement to clean-up existing Cache entries and
replace them with new cache entries, then it is more of a change in
application version maintenance of cache items than cache system.
Typically, each cache item can have version information (getter
method like getVersion()) attached to it and post
deployment, previous version entries can be discarded by the
application.
- You can also refer to Cache Invalidation Strategies,
which comes as an OOTB feature in Coherence.
Scenario 2 - Deployment of Application with
updated Coherence Application Libraries, which is using Coherence Data Grid
Layer
- Problem Statement: This scenario is
applicable in cases where there is usage of Coherence Application Cache particularly
where read-through or write-through patterns are implemented. In this
case, application specific JAR files or libraries need to be updated on
Coherence Nodes & hence infrastructure team tends to shutdown
entire Coherence Cluster with clean restart.
- Solution Approach:
- As a best practice, Coherence Cluster shutdown &
restart should be avoided wherever possible.
- A cyclic restart (or rolling restart) can help in this case along with version based maintenance
& cache invalidation strategies of cache items (as explained in
scenario 1).
- Note that invalidation or cache item clean-up plays critical role as
even if Coherence Nodes get restarted, data will get automatically
backed-up in data grid layer (by other nodes). In essence, failover feature is acting against clean deployment
in this case and hence need to be careful in clean-up approach in this
case.
Scenario
3 - Coherence Configuration Change as part of Deployment
- Problem Statement: This scenario is
applicable in cases where there are changes in Coherence Configuration
(Cluster Configuration or otherwise). Note that even if there
is any difference (even minor) in configuration of any Coherence Node, it
will get rejected by Coherence Cluster. For example, if there
is change in Security Configuration (using override file) or TTL change or
Coherence Edition Change.
- Solution Approach:
- The easiest approach is to shutdown entire Coherence Cluster (JMX
monitoring can help to make sure all Coherence Nodes are down) and post
configuration change, restart all nodes. But it defeats our purpose of
ZERO DOWNTIME.
- If Zero downtime is needed, then we
need to:
- Setup an entirely new Coherence Cluster (e.g. by assigning a new
multicast IP address or change of mutlicast port)
- Make Configuration Changes & do fresh deployment on
new cluster
- Do cyclic restart of dependent application servers using new
Coherence Cluster setup
- Discard Old Coherence Cluster post migration of old applications to
new Coherence Cluster
There
can be multiple other deployment scenarios possible but they can be variation
of scenarios described above (at least in my mind).
Hope it
helps to all those people who are seeking Zero Downtime Deployment
without paying extra for other products like Oracle GoldenGate to
achieve the same.
Disclaimer:
All data
and information provided on this site is for informational purposes only. This
site makes no representations as to accuracy, completeness, correctness,
suitability, or validity of any information on this site and will not be liable
for any errors, omissions, or delays in this information or any losses,
injuries, or damages arising from its display or use. All information is
provided on an as-is basis.This is a personal weblog. The opinions expressed
here represent my own and not those of my employer or any other organization.