For security and liveness reasons, Ethereum chose a multi-client architecture. To encourage stakers to diversify their setup, the associated failure penalty is higher. So, a staker running a few clients will usually only lose a modest amount in the event of a client bug, but running multiple clients can result in an overall loss. Therefore, responsible stakers should look at the client environment and choose a less popular client.
Why do we need multiple clients?
There is debate as to why a single-client architecture is preferable. And developing multiple clients incurs considerable overhead, which is why we don’t see any other blockchain network seriously pursuing the multi-client option.
So why is Ethereum aiming to be multi-client? Clients are very complex pieces of code that are likely to contain bugs. The worst of these are so-called “consensus errors,” errors in the blockchain’s core state transition logic. An oft-cited example is the so-called “infinite money supply” vulnerability, in which a vulnerable client accepts a transaction that prints an arbitrary amount of ether. If someone finds such a loophole but isn’t blocked before they get to the exit (i.e. via a mixer or exchange to exploit the funds), it will cause a massive collapse in the value of Ether.
If everyone was running the same client, stopping would require human intervention as the chain, all smart contracts and exchanges would continue to operate as usual. Even a few minutes is enough to execute a successful attack and spread the funds sufficiently to make it impossible to just roll back the attacker’s transaction. Depending on the amount of ETH printed, the community may coordinate to roll back the chain before exploiting the vulnerability (after bugs are identified and fixed).
Now, let’s see what happens when we have multiple clients. There are two possible scenarios:
1. The vulnerable client has less than 50% stake , the client will use the transaction exploiting the bug to generate a block, print ETH, let’s call this chain A.
However, most stakers running trouble-free clients will ignore this block because it is invalid (for them, the print ETH operation is invalid). They will build alternate chain B that does not contain invalid blocks .
With the majority of correct clients, the B chain will accumulate more proofs. Therefore, even the client in question will vote for chain B; the result is that chain B will accumulate 100% of the votes and chain A will die. The chain will continue as if the error never happened.
2. Most of the staked shares are used by the client in question , in which case Chain A will accumulate the majority votes. However, since B has less than 50% of all proofs, the offending client will never see a reason to switch from chain A to chain B. Hence, we will see a chain split.
Case 1 is the most ideal case. Because this will most likely result in an orphaned block that most users won’t even notice. The developer can then debug the client, fix bugs, and all is well. Conversely, case 2 is clearly less than ideal. But still better than having just one client – most people will quickly detect a chain split (you can do this automatically by running multiple clients), exchanges will quickly suspend deposits, and Defi users can quickly suspend deposits when the split resolves Proceed with caution. Basically, this still gives us a flashing red warning light compared to a single-client architecture, shielding us from the worst possible outcome.
Case 2 would be worse if the wrong client was run by more than 2/3 of the stake. In this case it will finalize the invalid chain. More on this later.
Some argue that a chain split is so catastrophic that it is an argument for a single-client architecture in its own right. Note, however, that chain splits only happen because of bugs in the client. For a single client, if you want to fix this and get the chain back to what it was, you have to rollback to the block before the error – which is as bad as a chain split! So while chain splitting sounds bad, it’s actually a feature, not a bug, in the case of a critical bug on the client side. At least you can see that something is seriously wrong.
Incentivizing Client Diversity: Anti-Relevance Penalty
If staking is spread across multiple clients, the best-case scenario is that each client owns less than one-third of the total stake, which is clearly beneficial to the network. This will make it resilient to errors in any single client. But why should stakers care? If the network does not have any incentives, they are less likely to incur the cost of switching to minority clients.
Unfortunately, we cannot make the reward directly depend on the client the validator is running on. There is no objective way to measure this that cannot be deceived.
However, you can’t hide when your client has errors. This is where the anti-correlation penalty comes in: the idea is that if your validator does something bad , the penalty will be higher if more validators are wrong at about the same time . In other words, you will be penalized for related failures.
In Ether, you can currently be cut for two actions:
1. Sign on two blocks of the same height.
2. Create a pair of pruning proofs (surround votes or double votes).
When you are slashed, you usually don’t lose all your money. As of this writing (Altair Fork), the default penalty is actually pretty small: you will only lose 0.5ETH, or 1.5% of the ether you stake (which will eventually increase to 1ETH or 3%).
However, there’s a catch: there’s an additional penalty that depends on all other slashes that happen within 4096 epochs (18 days) before and after your validator gets slashed. The amount you will be penalized further is proportional to the total amount cut during that time.
This could be a much larger penalty than the original one, currently (the Altair fork) it’s set up so that if more than half of your full staking balance is slashed during this time, then you lose all your funds. Ultimately, this will be set up so that if 1/3 of the other validators get chopped off, you lose all of your staking. 1/3 was chosen because of the minimum amount of stake that must be ambiguous to generate consensus failure.
Another Anti-Relevance Penalty: Secondary Inactivity Leak
Another way a validator can fail is offline, again, this has a penalty, but the mechanism is very different. We don’t call it a slash, and it’s usually small: under normal operation, validators who are offline are penalized the same as they would be if they were fully validating. At the time of writing, this is an annual growth rate of 4.8%. If your validator is offline for a few hours or days, for example due to a temporary internet outage, it may not be worth breaking a sweat.
The situation becomes very different when more than 1/3 of validators are offline. Then, the beacon chain cannot be finalized, threatening a fundamental property of the consensus protocol, liveness.
To rejuvenate in such situations, so-called “secondary inactivity leaks” come into play. If validators continue to go offline while the chain is not complete, the total penalty amount increases quadratically over time. Very low initially; offline validators will lose 1% of their stake after about 4.5 days. However, it increases to 5% after ~10 days and 20% after ~21 days (this is the value of Altair, which will double in the future).
The mechanism is designed so that in the event of a catastrophic event that causes a large number of validator operations to fail, the chain will eventually be able to complete again. As offline validators lose more and more of their stake, their share of the total stake will get smaller and smaller, and when their stake falls below 1/3, the remaining online validators get the required 2/3 majority, allowing them to finalize the chain.
However, there is another situation related to this: in some cases, validators can no longer vote for valid chains because they accidentally locked themselves in invalid chains, more information on this below.
How bad is it to run most clients?
To understand what the danger is, let’s look at three types of failures:
1. Mass slashing events: Most client-side validators signed slashable proofs due to bugs.
2. Lots of offline events: All majority client side validators are offline due to bugs.
3. Invalid block events: Due to errors, most client-side validators prove that there are invalid blocks.
Other types of massive failures and hacks can also occur, but I’ve limited them to errors related to client errors (which should be considered when choosing which client to run).
Scenario 1: Double Signature
This is probably the most feared scenario for most validator operators: a bug that causes a validator client to sign a removable proof. An example would be two provers voting for the same target epoch, but with different payloads. Because this is a client error, the concern is not just one staker, but all stakers running this particular client. Once these ambiguities are discovered, it will be a bloodbath: all stakers involved will lose 100% of their staked funds. This is because we are considering multiple clients: if the relevant client is only 10% staking, then “only” 20% will be slashed (in Altair; 30%, with the final penalty parameter set).
In this case, the damage is obviously extreme, but I also think it is highly unlikely. The conditions for removable proofs are simple, which is why Validator Clients (VCs) are built to enforce them. The validator client is a small, well-audited piece of software, and a vulnerability of this scale is unlikely.
We’ve seen some cuts so far, but as far as I can tell, all of them are due to operator failure – almost all of them are due to operators running the same validator in several locations. Since none of these are related, the amount of the cut is small.
Scenario 2: Massive offline event
For this scenario, we assume that most clients have a bug that, when triggered, causes the client to crash. The block in question has already been integrated into the chain, and whenever a client encounters the block, it goes offline, preventing further participation in the negotiation. Most clients are now offline, so inactivity leaks are starting to appear.
Client-side developers will be scrambling to put everything back together. Realistically, within a few hours, days at most, they’ll release a bug fix to get rid of the crash.
At the same time, the orderer may also choose to simply switch to another client. As soon as this is done enough to bring more than 2/3 of the validators online, the secondary inactivity leak stops. It’s not impossible until the buggy client is fixed.
This situation is not impossible (bugs that cause crashes are one of the most common types), but the total penalty may be less than 1% of the affected stake.
Scenario 3: Invalid data block
For this scenario, we consider the case where most clients have a bug that produces an invalid block, and also accepts it as valid — that is, when other validators using the same client see the invalid block , they will consider it valid, thus proving it.
Let’s call the chain that contains invalid blockchain A, once an invalid block is produced, two things happen:
1. All working clients will ignore invalid blocks and instead build on the latest valid heads that generate a separate chain B. All working clients will vote and build on chain B.
2. The faulty client thinks that both chains A and B are valid, so it will vote for whichever of the two chains it currently thinks is the heaviest.
We need to distinguish three cases:
1. Vulnerable clients hold less than half of the total stake . In this case, all correct clients vote and build on the B chain, which ultimately makes it the heaviest chain. At this point, even the buggy client will switch to chain B. Nothing bad can happen other than one or a few orphaned blocks. That’s comforting, and why only the majority of users are great.
2. Vulnerable clients hold more than half but less than two-thirds of the stake . In this example, we will see that two chains are being built – A by the client with the error and B by all other clients. Neither chain has a two-thirds majority, so they cannot be finalized. When this happens, developers will scramble to understand why there are two chains. When they discover that there are invalid blocks in chain A, they can proceed to fix the buggy client. Once fixed, it will recognize chain A as invalid. So it will start to build on the B chain, which will allow it to be finalized. This is very disruptive to the user. While it is hoped that confusion between which chain is valid will be brief, less than an hour, chains may not be finalized for a few hours or even a day. But for producers, even those running problematic clients, the penalties are still relatively light. If they don’t participate in chain B when they build an invalid chain A, they will be penalized with an “inactive leak”. However, since this may be less than a day, we’re talking about a penalty of less than 1% of staking.
3. The client in question holds more than two-thirds of the stake . In this case, the buggy client will not just build chain A – it will actually have enough stake to “finalize” it. Note that it will be the only client that will consider chain A complete. One of the finalized conditions is that the chain is valid, while for all other correctly operating clients, chain A will be invalid. However, due to the way the Casper FFG protocol works, when validators finalize chain A, they can never participate in another chain that conflicts with A without being axed, unless that chain is finalized (for any sense of detail For those who are interested, see Appendix 2). So, once chain A is finalized, validators running buggy clients are in a dire dilemma: they have committed chain A, but chain A is invalid. They can’t contribute to B because it hasn’t been finalized. Even bug fixes to their validator software won’t help them – they’ve already sent out offending ballots. What happens now is very painful: the B-chain, which has not been finalized, will go into a secondary inactivity leak. Over the course of a few weeks, the offending validators will leak their stake until they lose enough stake for B to be finalized again. Let’s say they start with 70% stake – then they will lose 79% stake, because that’s the amount they need to lose to represent less than a third of the total stake. At this point, chain B will be finalized again and all stakers can switch to it. The chain will be healthy again, but the outage will last for weeks, with millions of ETH destroyed in the process.
Clearly, case 3 is nothing short of a disaster. That’s why we’re extremely keen not to have any client hold more than two-thirds of the stake. then any invalid blocks cannot be finalized, which never happens.
So how do we evaluate these scenarios? A typical risk analysis strategy is to assess the likelihood of an event (1-extremely unlikely, 5-very likely) and impact (1-very low, 5-catastrophic). The most important risks to focus on are those that score well on both metrics, represented by the product of impact and likelihood.
With this in mind, Scenario 3 is by far the highest priority. When one client is in a supermajority of two-thirds, the impact is quite catastrophic, which is also a relatively likely scenario. To underscore how easy it is for such a bug to happen, such a bug happened recently on the Kiln Testnet (see Kiln Testnet Blocking Proposal Failed). In this case, Prysm did detect that the block was defective after it was proposed, and did not prove it. If Prysm thinks that blocking is valid, and this happens on Mainnet, then we are in the catastrophic situation described in Scenario 3, Scenario 3 – since Prysm currently has a 2/3 majority on Mainnet. So if you are currently running Prysm then you could lose all your funds which is a very real risk and you should consider switching clients.
Scenario 1 is probably the most worrying and gets a relatively low rating. The reason for this is that I think the probability of this happening is fairly low, as I think the Validator client software is well implemented on all clients and it is unlikely to generate skewable proofs or blocks.
If I’m currently running multiple clients and I’m worried about switching, what are my options?
Replacing clients can be a major undertaking, and it comes with some risks. What if the miter database is not migrated correctly to the new setup? There’s a risk of being axed, which totally defeats the purpose.
I would suggest another option to anyone worried about this. It is also possible to leave your validator setup as it is (no need to take out keys etc) and just switch beacon nodes. This is very low risk because as long as the validator client works as expected, it will never double-sign and thus cannot be hacked. Especially if you have large operations where changing the validator client (or remote signer) infrastructure would be very expensive and might require auditing, this might be a good option. It’s also easy to switch back to the original client, or try a handful of other clients, if the setup doesn’t perform as expected.
The good news is that you have little to worry about when switching beacon nodes: the worst thing it can do to you is to go offline temporarily. This is because the beacon node itself can never generate cuttable messages by itself. If you’re running a minority client, it’s impossible to end up in scenario 3, because even if you vote for an invalid block, the block won’t get enough votes to be finalized.
What about the punished client?
What I’ve written above applies to Consensus clients – Prysm, LighTower, Nimbus, Loestar and Teku, Prysm probably has a two-thirds majority on the web at the time of writing.
All of these apply to executing clients in the same way. Go-Etherum is likely to be the majority executing client after the merger, and if it produces invalid blocks, it may finalize, causing the catastrophic failure described in Scenario 3.
Fortunately, we now have three other execution clients ready to go into production – Nethermind, Besu, and Erigon. If you are a staker, I highly recommend running one of these. If you are running a minority client, the risk is very low! But if you run a majority of clients, you run a serious risk of losing all your money.
Addendum 1: Why not slash invalid blocks?
In Scenario 3, we have to rely on a quadratic inactivity leak to penalize validators that propose and vote for invalid blocks. It’s weird, why don’t we just punish them directly? This looks faster and less painful.
In fact, there are two reasons why we don’t – one is that we can’t do it at the moment, but even if we could, we probably wouldn’t:
1. Currently, it is almost impossible to introduce penalties (“slashes”) for invalid data blocks. This is because neither the beacon chain nor the execution chain is currently “stateless” – i.e. in order to check if a block is valid you need a context (“state”) of size 100s MB ( beacon chain ) or GB ( execution chain ) ). This means that there is no “concise proof” that the block is invalid. We need proof like this to slash a validator: the block that “slashes” a validator needs to include evidence that the validator has broken the law. In the absence of stateless consensus, there are ways to bypass this problem, but it would involve more complex structures such as multiple rounds of fraud proofs such as Arbitrum currently uses for aggregation.
2. The second reason we might be less eager to introduce this type of pruning is that, even if we could, it is because generating invalid blocks is harder to prevent than the current pruning conditions. The current conditions are so simple that the validator client can easily validate with just a few lines of code. That’s why I think Scenario 1 above is unlikely – so far the reducible messages have only been a result of operator blunders, and I think that’s likely to continue. Adding a mitre for producing invalid blocks (or proving them) increases the risk for coiners. Now, even those who run minority clients can face serious penalties.
All in all, it is unlikely that we will see direct penalties for invalid blocks and/or their proofs in the next few years.
Addendum 2: Why can’t a flawed client switch to chain B after finalizing chain A?
This section is for those who want to understand in more detail why a buggy client cannot simply switch back and have to suffer the dreaded inactivity leak. For that, we have to see how Casper FFG finally works.
Each proof consists of a source checkpoint and a target checkpoint. A checkpoint is the first block of an epoch. If there is a link from one epoch to another where the total number of votes for that link exceeds 2/3 of all stakes (i.e. there are so many proofs where the first checkpoint is “source” and the second checks point as a “target”), we call it a “supermajority link”.
An epoch can be either “reasonable” or “deterministic”, and they are defined as follows:
1. Epoch 0 is aligned.
2. An epoch is plausible if it has a supermajority connection to a plausible epoch.
3. Epoch X is finalized if (1) epoch X is aligned, and (2) the next epoch is also aligned, and the source of the vast majority of links is epoch X.
Rule 3 is slightly simplified (there are more conditions to finalize an epoch, but they are not important for this discussion). Now, let’s take a look at the conditions for a big cut. Slashing proofs have two rules. Both compare a pair of proofs V and W:
1. If V and W target the same epoch (i.e. same height), but they don’t vote for the same checkpoint (double voting), they are hackable.
2. This means that (1) V’s source is earlier than W’s source and (2) V’s target is later than W’s target (surround voting).
The first condition is obvious: it prevents simply voting for two different chains of the same height. But what does the second condition do?
Its function is to slash all validators involved in finalizing two conflicting chains (which should never happen). To see why, let’s look again at our scenario 3, where in the worst case there is an overwhelming majority of clients with errors (>2/3 of staking). As it continues to vote for the faulty chain, it will finalize the epoch with invalid blocks as follows:
The rounded squares in this image represent epochs, not blocks. The green arrow is the last supermajority link created by all validators. The red arrows are supermajority links that are only supported by buggy clients. A working client ignores epochs with invalid blocks (red). The first red arrow will justify the invalid epoch, and the second arrow will identify the invalid epoch.
Now let’s assume that the bug has been fixed, and that the validator that finalized the invalid epoch wants to rejoin the correct chain B. To be able to terminate the chain, the first step is to adjust epoch X:
However, in order to participate in the adjustment of epoch X (which requires a supermajority link indicated by the dashed green arrow), they will have to “jump over” the second red arrow – the one that ultimately determines the invalid epoch. Voting for these two links is an offense that can be chopped off.
This will continue to hold true for any later eras. The only way to fix it is through a secondary inactivity leak: as chain B grows, locked validators will leak their funds until chain B can be justified and finalized by working clients.
Posted by:CoinYuppie，Reprinted with attribution to:https://coinyuppie.com/ethereum-merge-run-multiple-clients-at-your-own-risk/
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.