Why is “data availability” critical to blockchain scaling?

As you may have heard, Ethereum’s sharding roadmap has largely eliminated execution sharding and today focuses solely on data sharding, maximizing Ethereum’s data space throughput.

You may have also seen the recent discussion on modular blockchains, dug into Rollup, learned about volitions and validiums, and heard about “Data Availability Solutions”.

However, in the process you may also have a question: “What exactly is data availability?”.

Before we start explaining this, we can review how most blockchains work.

Transactions, Nodes, and the Infamous Blockchain Trilemma

When you encounter a new OHM fork, if it has an astonishingly high annual interest rate, then you will definitely hit the “staking” button without hesitation. But what happens when you submit a transaction on Metamask?

Why is "data availability" critical to blockchain scaling?

In simple terms, your transaction will go into the mempool, assuming your bribe to the miner or validator is high enough, your transaction will be put into the next block and added to the blockchain for later people check. Then, the block containing your transaction is sent to the blockchain’s network of nodes. After that, the full node will download this new block, execute and calculate every transaction contained in this block (including yours of course), while ensuring that these transactions are valid transactions. For example, in your transactions, these full nodes may verify whether you have stolen funds from others, and whether you have enough ETH to pay gas fees, etc. Therefore, the important task of a full node is to enforce the rules of the blockchain for miners and validators.

It is precisely because of this mechanism that the traditional blockchain has the problem of scaling. Since full nodes check every transaction to verify that they follow the rules of the blockchain, the blockchain cannot process more transactions per second without improving the hardware (better hardware strengthens the function, and more powerful full nodes can verify more transactions, so there are more blocks that can contain a large number of transactions). However, if the hardware requirements to run a full node increase, the number of full nodes will be reduced and the process of decentralization will suffer – if there are fewer people who can ensure that miners/validators follow the rules, then The situation is quite dangerous (because the number of trust assumptions will increase).

Why is "data availability" critical to blockchain scaling?

Availability of data is one of the main reasons why we cannot have scalability, security and decentralization at the same time

This mechanism also illustrates the importance of guaranteeing data availability in traditional monolithic blockchains: block producers (miners/validators) must publish and provide transaction data for the blocks they produced so that full nodes can check them work. If block producers do not provide this data, full nodes cannot check their work, and there is no way to ensure that they are following the rules of the blockchain.

Now that you understand why data availability is important in traditional monolithic blockchains, let’s explore how it plays in the beloved scalability solution, Rollup.

How important is data availability in the context of Rollup

Let’s start by revisiting how Rollup solves the scalability problem: instead of increasing the hardware requirements to run a full node, why don’t we reduce the number of transactions a full node needs to verify validity? We can offload the computation and execution of transactions from the full node to a more powerful computer (also known as a sequencer).

But does that also mean we have to trust the serializer? If the hardware requirements of full nodes are to be kept low, they will definitely be slower than the sequencer when checking work. So how do we ensure that new blocks proposed by this serializer are valid (that is, to ensure that the serializer is not stealing everyone’s funds). Since this question has been raised repeatedly, I believe you already know the answer to this question, but please continue to read the following content patiently (if you can’t remember, you may wish to read these two articles by Benjamin Simon and Vitalik article to review):

For Optimistic Rollup, we can rely on fraud proofs to maintain the reliability of the sequencer (unless someone submits a fraud proof indicating that the sequencer contains an invalid or malicious transaction, we generally default the sequencer to work reliably). However, if we want others to be able to compute fraud proofs, they will need transaction data executed by the sequencer in order to submit fraud proofs. In other words, the sequencer must provide transaction data, otherwise no one can guarantee the reliability of the Optimistic Rollup sequencer.

Why is "data availability" critical to blockchain scaling?

With ZK Rollup, ensuring the reliability of the serializer becomes much simpler – the serializer must submit a validity proof (ZK-SNARK or ZK-STARK) when executing a batch of transactions, and this validity proof can guarantee No invalid or malicious transactions will appear in the serializer. Furthermore, anyone (even smart contracts) can easily verify these proofs. But for ZK-Rollup’s serializer, data availability is still very important, because as users of Rollup, if we want to invest in Shitcoin quickly, we need to know how much account balance we have on Rollup. But if the transaction data is not available, we can’t know our account balance and can no longer interact with Rollup.

The above makes us understand why people have always admired Rollup. Given that a full node doesn’t necessarily have to keep up with the sequencer, why don’t we just turn it into a powerful computer? This change will allow the sequencer to execute a large number of transactions per second, reducing gas fees and keeping everyone happy. However, the serializer still needs to provide transaction data, that is, even if the serializer is a true supercomputer, the number of transactions per second it can actually compute will still be limited by the underlying data availability solution or data availability layer it uses data throughput limit.

In short, if the data availability solution or data availability layer used by Rollup cannot store the amount of data that the Rollup Sequencer wants to dump, then the Sequencer (and therefore Rollup) can’t do anything even if it wants to process more transactions . At the same time, it will also increase the gas fee on Ethereum.

This is why data availability is extremely important – if data availability is guaranteed, we can regulate the behavior of the Rollup sequencer, and if the Rollup is prepared to maximize its transaction throughput, a data availability solution or data availability layer data Maximizing space throughput will also become critical.

But as you may have realized, we haven’t fully addressed the question of whether the sequencer will work. If the calculation speed of the full nodes of the Rollup main chain does not need to keep up with the sequencer, the sequencer can withhold a large part of the transaction data. The question is, how can the main chain node force the serializer to dump data above the data availability layer? And if the nodes can’t do that, we won’t make any progress in scalability at all, because then we’d have to trust the serializer or pay for the supercomputer ourselves.

The above problem is also known as the “data availability problem”.

Solutions to “Data Availability Problems”

The most straightforward solution to the data availability problem is to force full nodes to download all data dumped by the sequencer to the data availability layer or solution. But at the same time, we also know that this will not help us, because it requires full nodes to keep up with the transaction calculation speed of the serializer, and increases the hardware requirements to run full nodes, which will ultimately hinder the development of decentralization.

So it’s clear that we need a better solution to this problem, and luckily, we happen to have one.

Proof of data availability

Every time the serializer dumps a new block of transaction data, nodes can “sample” the data with proof of data availability to ensure that the data was indeed provided by the serializer.

While proof of data availability works involves a lot of math and technical jargon, I’ll try to explain it clearly (see John Adler).

We can first require that the block of transaction data dumped by the serializer be erasure-coded, which means that the original data will be doubled in size, and then new and additional data will be encoded as redundant data ( This is what we call erasure codes). After erasure coding, we can use any 50% erasure coded data to restore the full contents of the original data.

Why is "data availability" critical to blockchain scaling?

Erasure code technology and the game Fortnite uses a technology that allows you to keep bullying your nasty cousin and his friends after you scared the cat that time.

Note, however, that after a transaction block has been erasure-coded, the sequencer must withhold more than 50% of the block’s data in order to misbehave. But if the block is not erasure coded, then the sequencer only has 1% of the data left to misbehave. Therefore, by erasure coding the data, the full node is more able to ensure that the serializer can achieve data availability.

Nonetheless, we want to ensure as much as possible that the sequencer can provide all the data. Ideally, we would like the sequencer to be as reliable as we would download the entire block of transaction data directly, and in fact, this is completely achievable: full nodes can randomly download some data from the block. If the sequencer misbehaves, the full node will have a less than 50% chance of being spoofed, i.e. downloading a random portion of the data while the sequencer is trying to withhold it. This is because if the serializers are intent on misbehaving and withholding data, they must withhold data that is greater than 50% erasure coded.

At the same time, this also means that if the full node can perform this operation twice, the possibility of being deceived can be greatly reduced. By randomly selecting another piece of data for the second download, the full node can reduce the probability of being deceived to less than 25%. In fact, when a full node randomly downloads data for the seventh time, the probability of it failing to detect that the sequencer withholds data will be less than 1%.

This process is referred to as sampling using proof of data availability, or simply data availability sampling. It is very efficient, because the sampling allows nodes to download only a small part of the data published by the sequencer on the main chain, and it can be guaranteed that the effect is consistent with downloading and checking the entire block (nodes can use the main merkle root on the chain to find the content and region to sample). In order to give everyone a more intuitive feeling, you can imagine that if you can consume as many calories as running 10 kilometers for 10 minutes in the community, you can feel the power of data availability sampling.

If the main chain full nodes can sample data availability, we can ensure that the Rollup sequencer does not behave erroneously. We should all be happy now that we can be confident that Rollup will indeed scale our favorite blockchain. But before you want to quit this page, remember that we still need to find a way to expand the data availability itself? If we want everyone in the world to join the blockchain and make more money, then we need to build Rollup; if we want to use Rollup to expand the blockchain, then we not only need to limit the sequencer to do misbehavior, and we also had to scale the throughput of the data space to reduce the cost of the sequencer dumping transaction data.

Proof of data availability is also key to scaling data space throughput

One recent layer 1 with a roadmap focused on scaling data space throughput is Ethereum. It hopes to expand the throughput of the data space through data sharding, which means that not every validator will continue to download the same transaction data as current nodes (validators can also run nodes). Instead, Ethereum will essentially divide its validator network into different partitions, an operation also known as “sharding.” Let’s say you have 1000 validators, and they all use to store the same data, then if you split them into 4 groups of 250 validators each, you quadruple the Rollup dump data space in an instant. It seems simple enough, right?

Why is "data availability" critical to blockchain scaling?

Ethereum is trying to set up 64 data shards in its “near-term” data sharding roadmap

The problem, however, is that validators within a shard can only download transaction data that was dumped on their shard. And this means that validators within a shard cannot guarantee that all the data dumped by the sequencer is available – they can only guarantee that the data dumped to their shard is available, but not the data of other shards.

Because, we may have a situation where the validators in one shard cannot determine if the sequencer is misbehaving because they don’t know what’s going on in the other shards, and this problem can also be sampled with data availability to solve. If you are a validator in one shard, then you can sample data using proofs of data availability in every other shard. In this way, you are equivalent to the validator of each shard, data availability is guaranteed, and Ethereum can safely shard data.

Some other blockchains, namely Celestia and Polygon Avail also want to massively scale their data space throughput. Unlike most other blockchains, both Celestia and Polygon Avail only do two things: place orders and transactions, and be a data availability layer. This means that in order to guarantee the reliability of Celestia and Polygon Avail validators, we greatly need a decentralized network of nodes to ensure that their validators are properly storing and ordering transaction data. However, since these data do not require any processing (i.e. execution or computation), you do not need to use full nodes to guarantee their reliability. Conversely, light nodes that can sample data availability will also be able to do the work of full nodes, and if there are many light nodes that can be sampled with proof of data availability, it is enough to hold validators accountable for data availability. That is, as long as enough nodes use Proof of Data Availability to sample Data Availability (which is fairly easy to do given that Proof of Data Availability can even be calculated on mobile phones), you can scale up blocks and improve validator hardware requirements, thereby increasing the throughput of the data space.

Why is "data availability" critical to blockchain scaling?

Now let’s recap: data availability issues are perhaps the crux of the blockchain trilemma, affecting all of our scaling efforts. Fortunately, we were able to solve the data availability problem by leveraging the core technology of data availability proof. This allows us to massively scale the throughput of the data space, reducing the cost of Rollup dumping large amounts of transaction data so that it can process enough transactions for the entire world to participate. In addition, Proof of Data Availability also allows us to guarantee the reliability of the Rollup sequencer without having to trust it. Hopefully this article has given you an understanding of why data availability is so important to realize the full potential of Rollup.

Posted by:CoinYuppie,Reprinted with attribution to:https://coinyuppie.com/why-is-data-availability-critical-to-blockchain-scaling/
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.