How do I recover my cluster, when I see an "Invalid tlog reported" error?


FID/VDS uses the tlog mechanism in order to push changes from leader to all followers in a clustered environment.

if there is a tlog revision mismatch between the old leader and the newly elected leader of more than 1 revision.
Example here: the newly elected leader is 5 revisions behind
under $RLI_HOME\logs\vds_events.log, you will see this message
"Invalid tlog reported". Last transactions from the leader node haven't been published to this node. Refusing to become a leader to preserve cluster consistency. To force this node as a leader, remove its hdapStates in zookeeper or use the command line option -force when starting the server. Last invalid revision = 1240080 store revision=1240075"

This revision mismatch would lead to a loss of data between the cluster nodes and hence, proactively it will shut down the whole cluster.
Then, your application/LB should point to another working cluster for the traffic.

To recover from this,

find the "lastLeaderId" from the ZK tab under /radiantone/v2/<clustername>, then start that VDS server on that node first. You can start the rest of the nodes later.


