Node Churning Events
In the THORChain ecosystem, a "churn" happens when one node goes from "Active" -> "Standby" or "Disabled" and another goes from "Ready" -> "Active". Show the cadence of node churning events over time.
Thorchain Node Churning Events
As new nodes join/leave the network, this triggers a “churning event”. Which means the list of validators that can commit blocks to the chain changes, and also creates a new Asgard vault, while retiring an old one. All funds in this retiring vault are moved to the new Asgard vault.
Normally, a churning event happens every 3 days ( 50,000 blocks), although it is possible for it to happen more frequently (such as when a node optionally requests to leave the network). This is called the Churn Interval and it is a network parameter that can be adjusted. It currently stands at 43200 blocks.
Churning out
On every churn, the network selects one or more nodes to be churned out of the network (which can be typically churned back in later). In a given churning event, multiple nodes may be selected to be churned out, but never more than 1/3rd of the current validator set. The criterion the network will take into account is the following:
- Requests to leave the network (self-removal)
- Banned by other nodes (network-removal)
- How long an active nodes has been committing blocks (oldest gets removed)
- Bad behavior (accrued slash points for poor node operation)
Churning in
On every churn, the network may select one or more nodes to be churned into the network but never adds more than one to the total. Which nodes that are selected are purely by validator bond size, which is a RUNE amount each node sends to the network as a leverage to ensure they behave in the network's best interest. Larger bond nodes are selected over lower bond nodes.
Leaving the network
Churned out nodes will be put in standby, but their bond will not automatically be returned. They will be credited any earned rewards in their last session. If they do nothing but keep their cluster online, they will be eventually churned back in.
Alternatively, an "Active" node can leave the system voluntarily, in which case they are marked to churn out first. Leaving is considered permanent, and the node-address is permanently jailed. This prevents abuse of the LEAVE system since leaving at short notice is disruptive and expensive.
It is assumed nodes that wish to LEAVE will be away for a significant period of time, so by permanently jailing their address it forces them to completely destroy and re-build before re-entering. This also ensures they are running the latest software.
Methodology
To identify each churning event, the following methodology is used:
-
First,
former_status
andcurrent_status
are concatenated with theto
string -
Second, a case function names the churning events as
'Active to Standy' as 'Churn out'
'Active to Disabled' as 'Churn out'
'Ready to Active' as 'Churn in'
-
At last, a count of
node_addresses
aggregating byblock_id
gives the number of churns -
Additionally, the churn interval is calculated using the
lag
function
Figure 1. shows the number of churning events in the selected timeframe since January 1st 2022. Usually 2 more nodes are churned out as churned in, with 7, 6 or 5 being the most common number of churn outs in the churn interval. There are some exceptions were 4 or 3 churn outs occur an even times were more churn ins happen. And there is also a special events where up to 11 nodes were churned out and 9 nodes were churned in. This occurred on March 19th and the next event happened 6 days afterwards.
Figure 2. shows the same data as Figure 1. normalised. Usually the ratio lies around 60-40, having a couple of 50-50 days and 2 days were more churn ins happen with a ration of 56-44.
Figure 3. shows the Churn interval. It is currently define as 43200 blocks according to Mimir Documentation and the data show this value is correct most of the times, although values can go up to 50000 blocks.
On March 25th, an outlier with 96487 blocks happens. At first glance, this could be assumed to be missing data from the tables but comparing this with Figure 1. leads to think that this event was somehow "planned" as 11 nodes were churned out and 7 nodes were churned in.
All data is queried on thorchain.update_node_account_status_events
table.
Conclusion
The churning mechanism is a way to ensure the network is runned by clean an updated nodes with no records of ill-behaviour. It also allows nodes that decide to leave operation to do so in a scheduled way with no impact on network runtime.
We have seen that the measured churn interval corresponds quite well with the intended parameter and have seen that sometimes a bigger number of nodes are taking part in the churning events which then leads to a double churn interval. Can this be a planned maintenance of some kind?
Analysis done by @KaskCEA powered by Flipside!