Ethereum Merge: Running a dominant client? at your own risk

Special thanks to Vitalik Buterin, Hsiao-Wei Wang and Caspar Schwarz-Schilling for their feedback and comments.

Abstract : For the sake of security and liveability, Ethereum chose a multi-client architecture . To encourage diversification of stakers’ client settings, the Ethereum protocol has higher penalties for correlated failures . So for a staker running a client with a minority market share, the staker will typically only lose a modest amount if the client becomes buggy, whereas if the staker runs a market-dominant client client, then when the client encounters a bug, the pledger may lose all the pledged deposit. Therefore, responsible stakers should look at the current client distribution landscape and choose to run a less popular client .

Why do we need multiple clients?

There are many arguments that a single client architecture is preferable. There is a huge overhead in developing multiple clients , which is why we haven’t seen any other blockchain networks seriously pursue offering multiple clients.

So why is Ethereum aiming for a multi-client architecture? Clients are very complex code and may contain bugs. The worst of them are the so-called ” consensus bugs “, which are the blockchain core state transition logic bugs. An oft-cited example is the so-called “infinite money supply” bug , where a client with this bug will accept (approve) a transaction that mints any amount of ETH. If someone finds this kind of bug in a client and isn’t blocked until the person reaches a safe exit (ie, via a mixer or exchange to spend the funds), then this will cause the ETH to lose a lot of value .

If everyone was running the same client, stopping this would require human intervention , as in this case the blockchain, all smart contracts and exchanges would operate as usual. It may even take a few minutes for an attacker to successfully launch an attack and spread the funds sufficiently that it becomes impossible to just roll back the attacker’s transaction . Depending on the amount of ETH minted, the Ethereum community will likely coordinate a rollback of Ethereum to the state it was in before this attack (this needs to happen after the bug is identified and fixed).

Now let’s see what happens when we have multiple clients. There are two possible cases :

  1. Clients containing the bug carry less than 50% of the total network stake . The client will produce a block that contains a transaction that exploits the bug to mint ETH. Let’s call this chain the A chain .

However, the vast majority of stakers running another client that doesn’t have the bug will ignore the block because the block is invalid (for the client, the ETH minting operation is invalid). Therefore, the stakers (validators) running this client will build another chain, let’s call it the B chain , which does not contain this invalid block.

As the correct client dominates, the B chain will accumulate more attestations . Therefore, even the client with the consensus bug will vote for chain B; as a result, chain B will accumulate 100% of the votes and chain A will die. The blockchain will move on as if the bug never happened .

  1. The vast majority of stakers use this buggy client . In this case, the A chain will accumulate a supermajority of votes. But since chain B has less than 50% proof, that buggy client will never have a reason to switch from chain A to chain B. Therefore, we will see a blockchain split. As shown below:

Ethereum Merge: Running a dominant client? at your own risk

The first case above is the ideal case. This situation is likely to produce an orphaned block that most users will not notice. The developer can debug the client, fix the bug, and all is well. The second case is obviously not ideal, but it is still a better result than having a total of one client (i.e. a single client architecture), most people will quickly find out that there is a forked chain (You can automatically detect by running a few clients), exchanges will suspend deposits soon, and DeFi users can exercise caution until the chain split is resolved. Basically, this still gives us a big flashing red light warning compared to a single client architecture, allowing us to be protected from the worst possible outcome.

In the second case above, if the buggy client is run by more than 2/3 of the stakers, then the situation will be even worse because the client will finalize the invalid chain (ie chain A) . We will elaborate more on this below.

Some argue that chain splits are so catastrophic that this is itself an argument in favor of a single-client architecture. Note, however, that chain splits only happen due to a bug in the client. As far as the single client architecture is concerned, if you want to fix the bug and return the blockchain to a previous state, you have to rollback to the block before the bug occurred, which is as bad as a chain split! So, as far as multi-client architecture is concerned, chain splitting sounds bad, but in the case of serious bugs on the client side, chain splitting is actually a feature, not a bug. At least you can know that something serious has gone wrong.

Incentivizing Client Diversity : Anti-Relevance Penalty

It is obviously beneficial to the network if the validator’s stake is distributed across multiple clients, and in the best case each client holds less than 1/3 of the total stake . This will make the network resilient to any single client bug. But why should stakers care about this? If the network has no incentives for stakers, they are unlikely to be willing to incur the cost of switching to another minority client.

Unfortunately, we cannot reward validators directly based on which client they choose to run. There is no objective way to measure this.

However, you’re not immune when the client you’re running has a bug. This is where anti- correlation penalties come into play: the idea is that if you run a validator that acts maliciously, you will be penalized because there are more other validators out there Made a mistake about the same time to get higher . In other words, you get penalized for associative failures.

In Ethereum, you (the validator) get slashed for two actions :

  1. Sign two blocks at the same block height.
  2. Create slashable proofs (surround voting or double voting).

When you (the validator) are slashed, you usually don’t lose all your funds. At the time of writing (the beacon chain Altair fork), the default penalty is actually very small: you will only lose 0.5 ETH , or about 1.5% of your staked 32 ETH (which eventually increases to 1 ETH , that is, 3%).

However, there is a catch here: there is an additional penalty that depends on all other slashing events in the 4096 epochs (roughly 18 days) period before and after your validator is slashed. During this period, you will be subject to a forfeiture proportional to the total of these forfeitures . This may be much larger than the initial penalty. Currently (the beacon chain Altair fork) is set up in such a way that if more than 50% of your entire staking amount is slashed during this period (before and 18 days after your slash) , then you will lose your entire funds . Ultimately, this will be set up so that if 1/3 of the stakers are slashed, then you will lose all of your stake . It is set to 1/3 because this is the minimum amount of collateral required to cause a consensus failure. As shown below:

Ethereum Merge: Running a dominant client? at your own risk

Above: The blue line represents the current (after the beacon chain Altair upgrade) penalty; the red line represents the final penalty that will be set.

Another Anti-Relevance Penalty: Quadratic Inactivity Leak

Another way a validator can fail is offline. Validators are also penalized for being offline, but the mechanism is very different. We don’t call this penalty “slashing”, and this offline penalty is usually very slight under normal operation, offline validators would be penalized in an amount equal to the amount they pass correctly during this time. The rewards that would have been available for verification were comparable . At the time of writing, the annual rate of return for validators is 4.8%. If your validator is offline for a few hours or days (e.g. due to a temporary internet outage) then there is probably nothing to be nervous about.

However, when more than 1/3 of the validators are offline , the situation becomes completely different. At this point, the beacon chain will be unable to finalize blocks , threatening a fundamental property of the consensus protocol, liveness . In such a scenario, to restore the activity of the beacon chain, the so-called ” quadratic inactivity leak ” mechanism is used . If a validator continues to go offline while the blockchain stops being finalized, the penalty for that validator will increase quadratically over time . Initially this penalty is very low; after about 4.5 days, offline validators will lose 1% of their collateral. However, after about 10 days, 5% of the collateral will be lost, and after about 21 days, 20% of the collateral will be lost (these are the values ​​currently set by the beacon chain Altair, which will double in the future). As shown below:

Ethereum Merge: Running a dominant client? at your own risk

This mechanism is designed to allow the blockchain to resume finalizing blocks as quickly as possible in the event of a catastrophic event that takes a large number of validators offline. As offline validators lose more and more of their stake, they will be gradually kicked out of the network and thus make up a smaller and smaller share of the total validator population. At 1/3 of the total network stake, the remaining online validators will regain the 2/3 majority required for blockchain consensus, allowing them to finalize the blockchain.

However, there is a related situation: in some cases, validators can no longer vote on a valid chain because they accidentally locked themselves in an invalid chain . We explain this further below.

How bad is it to run a dominant client?

To understand the danger, let’s look at three types of failure events:

  1. Massive slashing events : Due to a bug, validators running the dominant client (i.e. the client most validators choose to use) will sign slashing events.
  2. Massive offline event : All validators running dominant clients are offline due to a bug.
  3. Invalid block event : Due to a bug, all validators running dominant clients certify an invalid block.

There are other types of massive failures and slashes that can happen, but I’ve limited myself to these client-side bug-related events (these are things you should consider when choosing which client to run).

Scenario 1: Double Signature

This is probably the most worrying situation for most validator operators: a bug that caused validator clients to sign slashable events . An example of this is two attestations voting for the same target epoch, but with different payloads (this is called  Double Signing” ). Since this is a client bug , not just one staker needs to worry, but all stakers running this particular client need to worry. When this behavior is discovered, the slashing will turn into a bloodbath: all relevant stakers will lose 100% of their pledged deposits . This is because what we consider here is that these stakers are running a dominant client (that is, the client most stakers choose to use); whereas if the stakes carried by the relevant clients only account for the total stake in the network 10% of the gold (that is, the client is not dominant client ), then “only” about 20% of the relevant pledge deposit will be slashed (this is the slashing intensity since the upgrade of the beacon chain Altair; when the final penalty parameter takes effect , this ratio will increase to 30%.)

The damage from this situation is obviously extreme, but I also think it is highly unlikely . Satisfying the conditions to be a slashable proof is simple, which is why validator clients (VCs) are built to enforce them. The validator client is a small, well-audited piece of software, and vulnerabilities at this level of scale are unlikely .

We’ve seen some slashes so far, but as far as I can tell, they’ve all been due to operator mishandling – almost all of them are due to operators running the same validator in multiple locations. As these are unrelated acts, the fines are small.

Scenario 2: Massive offline event

For this scenario, we assume that the dominant client has a bug that, when triggered, will cause the client to crash. An illegal block has been integrated into the blockchain, and whenever the dominant client encounters the block, the client is taken offline, preventing it from participating in any further consensus. Since the dominant client is now offline, the Inactivity Leaks penalty kicks in at this point.

Client developers will be scrambling to get things back to normal. In fact, within a few hours, at most a few days, they will release a bug fix to remove the crash. During this time, stakers can also choose to simply switch to another client . As long as enough stakers switch to another client so that more than 2/3 of the validators are online, the quadratic inactivity leak penalty mechanism will stop. Until this buggy client is fixed, it’s not impossible for this to happen.

This scenario is not impossible (bugs that cause clients to crash are the most common type of bugs), but the total amount penalized as a result may be less than 1% of the affected stake .

Scenario 3: Invalid Block

For this scenario, we consider a situation where the dominant client has a bug that causes the client to produce an invalid block and accept the block as valid—that is, when When validators using this dominant client see this invalid block, they will treat it as a valid block and attest (vote for) the block.

We call the chain containing this invalid block the A chain . Once this invalid block is produced, two things happen:

  1. All other functioning clients will ignore this invalid block and build a separate blockchain on top of the most recent valid block header, which we call the B-chain . All functioning clients will vote for the B chain and continue building this chain. As shown below.
  2. The buggy dominant client will see both chain A and chain as valid chains . Therefore, the client will vote for whichever chain it considers to be the most “heavy” at the moment.

Ethereum Merge: Running a dominant client? at your own risk

We need to distinguish three cases :

  1. This buggy dominant client carries less than 1/2 of the total stake . In this case, all other healthy clients will vote for the B chain and continue to build the B chain, eventually making the B chain the heaviest chain . At this point, even this buggy client will switch to the B chain. Nothing bad will happen, except that one or a few orphan blocks will be produced. This is a happy situation and why it is important to have clients other than the dominant client in the network.


  2. This buggy dominant client carries over 1/2 but less than 2/3 of the total stake . In this case, we’ll see two chains being built – chain A by the buggy client, and chain B by the other client. Neither chain has an absolute advantage of 2/3 validators, so neither chain can finalize blocks . When this happens, developers will scramble to understand why there are two chains. When they find an invalid block in chain A, they can go ahead and fix the buggy dominant client. Once the bug fix is ​​complete, the client will treat the A chain as a dead chain and start building the B chain, so that running the B chain can achieve block finalization. For users, such blocks are very disruptive . While the time required to figure out which chain is a valid chain is expected to be short (maybe less than an hour), blockchains are likely to be unable to finalize blocks for hours (or even a day). But for stakers, even those running the buggy dominant client, the penalty remains relatively small : they will be penalized for building an invalid A chain instead of participating in the B chain consensus. Being penalized for “inactivity leak”, but since this is likely to last less than a day, the corresponding penalty may be less than 1% of the pledged deposit.


  3. This buggy dominant client carries over 2/3 of the stake . In this case, the buggy dominant client will not only build the A chain, but will actually have enough collateral to “finalize” the A chain . It is important to note that this client is the only one that considers the A chain to be finalized, while all other functioning clients consider the A chain to be dead. However, due to the way the Casper FFG protocol works, when a validator finalizes chain A, that validator can no longer participate in another chain that conflicts with chain A without being slashed, unless this The chain can be finalized. So once chain A has been finalized, validators running this buggy client are in a dire dilemma: they have voted for chain A, but chain A is dead; they can’t build chain B either, because The B chain cannot be finalized yet . Even bug fixes to their client won’t help them because they’ve already sent the offending vote. What ‘s going to happen right now is very painful: Chain B, which fails to finalize a block, will initiate an Inactivity Leak penalty, and over the next few weeks, validators on chain A will lose their stake until there is enough stake. Gold is destroyed, allowing the B chain to resume finalization . Assuming stakers on chain A start with 70% of the network’s stake, they will lose at least 79% of their stake, as this is where their stake is reduced to represent less than 1/3 of the network’s stake The amount that the pledged deposit must be lost. At this point, the B chain will resume finalization, and all pledged deposits can be switched to the B chain. The blockchain will return to a healthy state again, but before this disruption will continue for weeks, with millions of ETH being destroyed in the process .

Obviously, the third scenario above is a disaster. This is why we very much prefer not to allow any one client to have more than 2/3 of the total stake . This way invalid blocks will never be finalized and this disaster will never happen.

Risk Analysis

So how should we assess these situations? A typical risk analysis strategy is to assess the likelihood of an event occurring (we use the number 1 for extremely unlikely, and the number 5 for quite likely) and impact (number 1 for very low and number 5 for catastrophic). The most important risks are those that score high on these two metrics, represented by the product of impact and likelihood.

Ethereum Merge: Running a dominant client? at your own risk

Based on the table above, the riskiest by far is Scenario 3 . This is a rather disastrous situation when a client has a supermajority of 2/3 of the stake, which is also a relatively probable situation. To underscore how easy such bugs are, a similar bug occurred recently on the Kiln testnet: in this case, the Prysm client did propose a block before discovering it was wrong, and didn’t prove the block Piece. If the Prysm client considered the block to be a valid block at the time, and if this happened on the Ethereum mainnet (and not the testnet), then we would have the third scenario in Scenario 3 – because the Prysm client The peer currently has a 2/3 supermajority in the Ethereum mainnet . Therefore, if you are currently running the Prysm client, you may lose all your funds, which is a very real risk and you should consider switching clients.

While Scenario 1 above is probably the most feared scenario, it has a relatively low risk rating. The reason is that I think the probability of scenario 1 happening is very low , because I think the validator client software is well implemented in all clients and unlikely to produce slashable proofs or blocks Piece.

If I’m currently running the dominant client and am concerned about switching clients, what are my options?

Switching clients can be a major undertaking. It also comes with some risks . What if the slashing database was not properly migrated to the new setup? There may be a risk of slashing, which completely defeats our purpose of switching clients.

For those concerned about this, I also suggest another option: you can leave your validator setup as it is (no need to take out keys, etc.) and just switch beacon nodes . This is extremely low risk, because as long as the validator client works as expected, it doesn’t double-sign and therefore doesn’t get slashed. Especially if you run a large validator business that makes changes to validator client (or remote signer) infrastructure very expensive and likely to require auditing, then this might be a good option. If the new setup doesn’t perform as well as expected, it’s easy to switch back to the old client, or try switching to another non-dominant client .

The advantage of this is that you have little to worry about when switching beacon nodes: the worst thing it can do to you is to temporarily go offline . This is because the beacon nodes themselves can never generate slashable messages themselves. If you’re running a non- dominant client , you can’t face Scenario 3 above , because even if you vote for an invalid block, the block won’t get enough votes to be finalized.

What about the execution client?

What I’ve written above applies to consensus clients , including Prysm, Lighthouse, Nimbus, Lodestar, and Teku , and as of this writing, Prysm probably has a 2/3 supermajority on the network .

All of these apply to executing clients in the same way . If Go-ethereum (which client is likely to become the dominant executing client after the merge ) produces an invalid block, the block may be finalized, thus resulting in the scenario 3 described above. catastrophic situation.

Fortunately, we now have 3 other execution clients in production ready: Nethermind, Besu, and Erigon . If you are a staker, I highly recommend running one of these execution clients. If you’re running a non-dominant client , the risk is pretty low! But if you’re running a dominant client , you run a serious risk of losing all your money!

Posted by:CoinYuppie,Reprinted with attribution to:
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.

Like (0)
Donate Buy me a coffee Buy me a coffee
Previous 2022-03-29 09:25
Next 2022-03-29 09:26

Related articles