Paradigm: Challenges and Future R&D Directions of Public Chain Data Availability Sampling (DAS) Technology

Note: The original author Joachim Neu Joachim-Neu is a research intern at Paradigm, and he is also a PhD student in blockchain science at Stanford University. His current research focuses on provable consensus security.

Introduction

The core responsibility of any L1 blockchain is to ensure data availability. This guarantee is critical for clients to be able to interpret the L1 blockchain itself, and it is also the basis for higher-level applications such as rollup. To this end, an oft-discussed technique is used for random sampling for data availability verification, as popularized in a 2018 paper by Mustafa Al-Bassam, Alberto Sonnino, and Vitalik Buterin. The technology is at the heart of the Celestia blockchain and was proposed to be included in proof-of-stake (PoS) Ethereum through “Danksharding”.

The purpose of this blog post is to explain the fundamentals of Data Availability Sampling (DAS), the models it relies on, and the challenges and unsolved issues faced when implementing the technique in practice. We hope this article will draw researchers’ attention to this problem and inspire new ideas to address some of the outstanding challenges (see the Ethereum Foundation’s recent proposal Request‌).

question

Someone (such as an L1 block proposer or an L2 sequencer) generates a data block. They claim to have made the data available to the “public”. Your goal is to check availability claims, i.e. are you actually able to get the data if needed?

Availability of data is critical, Optimistic systems based on fraud proofs, such as Optimism, require data availability for verification, and even systems based on proofs of validity, such as StarkNet or Aztec, also require data availability to ensure liveness (eg. , proof of asset ownership for escape hatches for rollup or forced transaction inclusion mechanisms).

For the formulation of the problem so far, there is a simple “naive” testing process, which is also implicitly adopted by early systems such as Bitcoin: simply download the entire block of data. If you succeed, you know it’s available, and if you don’t, you don’t think it’s available. However, now we want to test the availability of the data without having to download too much data ourselves, for example because the amount of data exceeds our processing capacity, or because it seems like spending a lot of bandwidth on data that we are not actually interested in verifying its availability Very wasteful. At this point, we need a model to clarify the “meaning” of only downloading or retaining “part of the data”.

Model

A common approach in computer science is to first describe a new technique in a model with fairly rich facilities, and then explain how to implement the model. We take a similar approach to DAS, but as we’ll see, interesting open-ended R&D questions pop up when we try to instantiate the model.

In our model, there is a bulletin board in a dark room (see comic below). First, block producers enter the room and have the opportunity to write some information on the bulletin board. When the block producer exits, he can give you (the validator) a small piece of information (the size is not linearly proportional to the original data). You enter the room with a flashlight that has a narrow beam and a low battery, so you can only read text in a few different spots on the bulletin board. Your goal is to convince yourself that the block producer did leave enough information on the bulletin board so that if you turn on the light and read the full bulletin board, you can restore the file.

Paradigm: Challenges and Future R&D Directions of Public Chain Data Availability Sampling (DAS) Technology

At first, the problem seemed tricky: we could ask block producers to write full documents on the bulletin board. Now consider two possibilities: either the block producer honestly wrote the entire file, or the block producer misbehaved and left out a small piece of information, making the entire file unusable. By checking the bulletin board in only a few locations, you cannot reliably distinguish between the two, and therefore, you cannot accurately check for data availability. We need a new approach!

(Theoretical) Solutions

This is where Reed-Solomon erasure codes come into play. Let’s recap briefly. In simple terms, erasure codes work as follows: A vector of k blocks of information is encoded into a (longer!) vector of n encoded blocks. The coding ratio R = k/n measures the redundancy introduced by the coding. Then, from some subset of the encoded blocks, we can decode the original information blocks.

Paradigm: Challenges and Future R&D Directions of Public Chain Data Availability Sampling (DAS) Technology

If the encoding is Maximum Distance Separable (MDS), then the original information block can be recovered from any subset of the encoded block size, which is a useful efficiency and robustness guarantee. Reed-Solomon codes are a popular family of MDS codes that work as follows. Remember, in school, you probably know two things that uniquely determine a line:

Paradigm: Challenges and Future R&D Directions of Public Chain Data Availability Sampling (DAS) Technology

This is because a line can be described as a polynomial of degree 1 with two coefficients: y = a1x+a0 (we now assume that the points have different x-coordinates). In fact, this idea can be generalized: for any degree polynomial t-1, it corresponds to the description polynomial

Paradigm: Challenges and Future R&D Directions of Public Chain Data Availability Sampling (DAS) Technology

a set of coefficients

Paradigm: Challenges and Future R&D Directions of Public Chain Data Availability Sampling (DAS) Technology

, which is uniquely determined (with distinct x-coordinates) by any t points that the polynomial passes through. In other words: once you know the evaluation of the polynomial in different places, you can get its evaluation in any other place (recover the polynomial first, then evaluate it).

Reed-Solomon codes are built on this insight. For encoding, we start with k information blocks

Paradigm: Challenges and Future R&D Directions of Public Chain Data Availability Sampling (DAS) Technology

To start, construct the relevant polynomials

Paradigm: Challenges and Future R&D Directions of Public Chain Data Availability Sampling (DAS) Technology

, and evaluate it at different x-coordinates to get the encoded block. Now, thanks to the above insight, any k of these encoded blocks allows us to uniquely recover a polynomial of degree k-1, and read the coefficients to obtain the original block of information. Voila!

Paradigm: Challenges and Future R&D Directions of Public Chain Data Availability Sampling (DAS) Technology

Back to our data availability problem: instead of asking the block producer to write down the original file on the bulletin board, we ask him to divide the file into k blocks, encode them using a Reed-Solomon code, e.g. rate R =1/2, and write n = 2k encoded blocks to the bulletin board. For now let’s assume that block producers are at least honestly following the encoding (we’ll see how to lift this assumption later). Consider again two cases: the producer acts honestly and writes all blocks, or the producer misbehaves and wants to keep the file unavailable. Recall that we can recover the original file from any k of n = 2k encoded blocks. So in order to keep the file unavailable, block producers can write at most k-1 blocks. In other words, now at least k+1, more than half of n=2k encoded blocks will be lost!

But now the two cases, a filled bulletin board and a half-empty one, are easy to distinguish: you check the bulletin board at a few r randomly sampled locations, and if each sampled location has its own block, then The file is considered available, if any sampling location is empty, the file is unavailable. Note that if the file is not available, so (more than) half of the bulletin boards are empty, you erroneously assume that the file is available with less probability than

Paradigm: Challenges and Future R&D Directions of Public Chain Data Availability Sampling (DAS) Technology

, that is, exponentially small in r.

(actual) challenges

The given “dark room bulletin board” model is very simple. Now let’s think about the model: what does the component represent? Can we implement them in a real computer system? How to achieve?

In fact, to help uncover the gap between theory and practice, we’ve explained the problem and solution using the “weird” “billboard in the darkroom” model, in which metaphors bear little resemblance to real computing systems. This is to encourage you to think about how aspects of the real world and the model world correspond, and how they (can’t) be achieved. If there are parts of your model that don’t translate to computer/network/protocol equivalents, then you know there’s still something to do, maybe there’s something wrong with your understanding, maybe it’s an open research question! ;)

This is a non-exhaustive collection of challenges, for some of which the community has found reasonable answers over the years, while others remain open research questions.

Challenge A: How to ensure that the block on the bulletin board was actually written by the proposer? Consider the change of the sampled block as it is transmitted to the sampling node in any form on the network. This is the source of a small piece of information that the block producer can pass to the sampling node when the producer leaves and the sampling node enters the darkroom. In practice, this is implemented as a bound vector commitment (think Merkle tree) to the original content written to the bulletin board and shared as part of the block header. After giving a commitment, the block producer can leave a proof of each encoded block on the bulletin board to show that the block was indeed written by the block producer. Third parties cannot change blocks in transit, because the commitment scheme does not allow forging valid proofs for modified blocks. Note that this by itself does not preclude block producers from writing invalid/inconsistent blocks on the bulletin board, which we discuss next.

Challenge B: Make sure the block generator erasure coding is correct . In the above scheme, we assume that the block producer encodes the information block correctly, so the guarantee of erasure code holds, that is, from enough encoded blocks, the information block can actually be recovered. In other words, all a block producer can do is keep blocks, but not confuse us with invalid blocks. In practice, there are three common ways to exclude invalid encodings:

Fraud Proof . This approach relies on the fact that some sampling nodes are powerful enough to sample so many blocks that they can spot inconsistencies in block encodings and issue invalid encoding fraud proofs to convert the document in question marked as unavailable. Work in this area aims to minimize the number of blocks a node has to check (and forward as part of a fraud proof) to detect fraud (see the original Al-Bassam/Sonnino/Buterin paper for which 2D Reed-Solomon codes are used‌ ).

Polynomial commitment . The method uses the KZG polynomial commitment‌ as a bound vector commitment included in the block header to solve challenge A. Polynomial commitments allow direct verification of Reed-Solomon encoded blocks against commitments to unencoded blocks of information, so there is no room for invalid encodings. Think of it this way: vector commitments and Reed-Solomon encodings are inseparable in polynomial commitments.

Proof of validity . A cryptographic proof system can be used to prove the correct erasure coding of the encoded block committed by the vector. This approach is a good pedagogical “mental model” and is generic for the erasure codes used, but may not be efficient for quite some time.

Challenge C: What is a bulletin board and where is it? How does the proposer “write” to it? Before we discuss the “what” and “where” of the bulletin board, how proposers “write” to it, and how validators “read”/”sample” from it, let’s review the two basic P2P network primitives Well-known disadvantages:

1. Low-level flood-based publish-subscribe gossip networks, such as GossipSub‌, where communications are organized into different “broadcast groups” (“topics”) to which participants can join (“subscribe”) and send messages ( “release”):

  • Insecure under arbitrary (“Byzantine”) adversarial behavior (e.g., eclipse attacks, Sybil attacks, peer discovery attacks);
  • Common variants don’t even offer Sybil resistance
  • There is usually no way to guarantee the privacy of a participant’s group membership from other participants (in fact, group membership is often communicated with peers to avoid them forwarding unwanted topic network traffic)
  • If there are a large number of topics and few subscribers per topic, communication tends to become unreliable (as subgraphs of nodes subscribed to a particular topic may no longer be connected, so the flood may fail)

2. Distributed Hash Tables (DHT), such as Kademlia‌, where each participant stores a portion of the total data stored in the hash table, and participants can quickly determine short paths to peers that store specific information:

  • Nor is Byzantine fault tolerant (e.g. inappropriate routing of requests by honest participants, attacks on network formation/maintenance)
  • In fact, DHT is much less resilient to adversarial behavior than the gossip protocol: the gossip protocol “only” requires that the subgraphs formed by honest nodes (and the edges between honest nodes) be connected so that information can be retrieved from any honest Nodes reach all honest nodes. Whereas in DHT, information is exclusively routed along a path, a query may fail when it reaches an adversary node on its path.
  • Also does not provide Sybil resistance mechanism
  • The privacy of which participants store or request which information (from the curious eyes of other participants) is not guaranteed

With this in mind, we can return to our central question about how to implement a bulletin board and its read/write operations. Where are the encoded blocks stored? How do they get there? The three main approaches the community is considering are:

  1. GOSSIP : Decentralize encoding blocks using a gossip network. For example, each encoded block might have a topic, and the node responsible for storing a block could subscribe to the corresponding topic.
  2. DHT : Upload the encoded chunks into the DHT. The DHT will then “automatically” assign each participant the blocks they should store.
  3. REPLICATE : Samples from nearby replicas. Some nodes store full (or partial) copies of the data and serve block requests to sampling nodes.

The challenges of these approaches are:

  1. How to ensure “enough space on the bulletin board” to start (ie there are enough participants to subscribe to each topic in GOSSIP, or each node can store all the blocks it needs to store under DHT), and the bulletin board’s All parts stay online over time? (Ideally, to ensure scalability, we would even want to use storage efficiently, i.e. there shouldn’t be too much redundancy between what honest nodes store.) In a truly permissionless system, this would be especially tricky ( In this system, nodes come and go, and there may be no Sybil resistance mechanism), so most nodes may be adversarial and may disappear in an instant. Fortunately, in a blockchain environment, some Sybil resistance mechanism (like PoS) usually exists and can be used to build reputation and even conduct attacks, but there are many details on how Sybil resistance mechanisms can be leveraged to protect the peer-to-peer network layer To be determined.
  2. Expanding on the previous point, since the network is the basis of consensus, and therefore of so-called Byzantine Fault Tolerant (BFT) systems, the network layer itself is preferably BFT – but as mentioned, popular gossip or DHT protocols such as GossipSub or Kademlia) not so. (Even REPLICATE may face this challenge, as DHT may still be used in other parts of the network stack, such as for peer discovery; but at this point, but at this point, the DHT challenge becomes a general network layer problem , not specific to data availability sampling.)
  3. Finally, some argue that nodes should store or forward no more than a fraction of a block in the long run, otherwise scalability and the possibility to support relatively “weak” participants (see decentralization) are limited . This is the opposite of REPLICATE. For GOSSIP, this requires a large number of broadcast groups (“topics”), each with a small number of subscribers, in which case the gossip protocol tends to become less reliable. In any case, the above approach comes with overhead, e.g. the bandwidth for forwarding blocks of data on behalf of other nodes must not exceed a single node’s budget.

Challenge D: “How” do we implement random sampling? There are two aspects to this problem: how the desired block is located and transmitted in the network (i.e. how is it “read” from the bulletin board), and how to ensure that the sampling “remains random” relative to the adversary, i.e., the adversarial block producer There is not (too much) opportunity to adaptively change its policy based on who queries which blocks.

  • Of course, sampling directly from the block producer is not a viable option, as this requires high bandwidth from the block producer, and if everyone knows the network address of the block producer, there will be an associated denial of service vector. (Some hybrid structures involving pulling from block producers can be viewed through the lens of DHT and REPLICATE)
  • Another way is to sample from a swarm “swarm” after dispersing the block using one of the above methods (GOSSIP or DHT). Specifically:

(1) DHT may conveniently route sampling requests and random sampling blocks after using GOSSIP or DHT to scatter blocks, but this presents the challenges discussed above, most notably the lack of BFT and privacy.

(2) Alternatively, under GOSSIP, each node can subscribe to the topic corresponding to the block it wants to sample – but there are challenges mentioned above: besides the lack of BFT and privacy, having a large number of topics and each subscriber is very Less will result in unreliable communication.

  • REPLICATE can compromise between “samples from block producers” and “samples from swarms”, where blocks are drawn from full copies of the data, and copies are identified among network peers.

Note that the above only addresses sampling (“reading” from the bulletin board now), not “reading” from the bulletin board anytime in the future. Specifically, GOSSIP essentially implements a temporary bulletin board (which can only be read/sampled while its contents are being written/scattered), while DHT implements a permanent bulletin board (which can also be read/sampled after a long time) sampling). Often, what we need is a permanent bulletin board (permanence requirements vary from “days” to “forever” depending on the specific design), and for this, GOSSIP must be supplemented by DHT to route blocks, which poses the aforementioned challenges. And REPLICATE immediately implements a permanent bulletin board.

The following table illustrates the different situations in which different P2P protocols are used to implement the model. Specifically, there are two variants of the gossip-oriented approach, one that uses gossip to sample chunks and the other that uses DHT to sample chunks. In contrast, DHT-oriented methods completely rely on the DHT for all relevant operations. In a replication-oriented approach, each node reads/samples blocks from nearby full replicas using a request/response protocol. It effectively uses gossip for the initial propagation of blocks, although gossipping between two peers can technically be accomplished via a request/response protocol.

Paradigm: Challenges and Future R&D Directions of Public Chain Data Availability Sampling (DAS) Technology

Furthermore, in all the above techniques, “who samples what” is (at least partially) leaked to the attacker, so the attacker can deceive a certain node by adaptively weakening/promoting the propagation of blocks sampled by some nodes through his own behavior Some nodes believe that the block is (un)available. While early work suggests that only a few nodes can be spoofed, this is not desirable. Alternatively, earlier work assumes anonymous network communication, which in practice comes with at least a considerable performance penalty, if not completely impractical.

Challenge E: How to “fix” the content of the bulletin board? That is, if an encoded block is lost (e.g. because the node storing the block has gone offline; how is this detected?), how is it recovered? Simple fixes involve decoding and re-encoding, and thus imposes a considerable communication and computational burden, especially for common Reed-Solomon erasure codes. Who will bear this burden? How are they compensated? How to avoid malicious block producers from harming sampling nodes by keeping some encoded blocks and forcing nodes to spend resources on expensive fixes? What about distributed maintenance programs? Fix how the required blocks are retrieved, which goes back to the previous point about “reading” from the bulletin board in the future.

Challenge F: Incentives . How to prevent denial of service vectors if sampling is free? If sampling is paid (how is it implemented?), how can it be completely anonymous at the same time? How are those who store (part of) bulletin boards, routing information, or perform maintenance tasks like block repairing in a peer-to-peer network compensated?

another model

For completeness, we briefly mention a slightly different model, DAS implements slightly different guarantees. Even without an anonymous and reliable network, attackers can trick up to a certain number of honest nodes into believing that unavailable files are available. Otherwise, it would have to free so many blocks in order to recover the file from the union of all blocks obtained by honest nodes. The advantage of this model is that the desired properties of the network are easier to achieve (especially when the peer-to-peer network is compromised by adversaries). The downside is that there are no specific guarantees for a single user (you could be one of the few being scammed!), and it is unclear how to collect all the samples obtained by honest nodes and recover files (especially when the P2P network has been compromised by an adversary).

Future research and development directions

Based on the observations and arguments presented in this blog post, we believe the following will be some interesting directions for future research and development on data availability:

  • Clearly, some Sybil resistance mechanism is necessary to protect the network layer (currently, it can be said that network protocols often rely implicitly on the scarcity of IP addresses, e.g., see GossipSub v1.1 for peer scoring). Conveniently, the consensus layer provides exactly this, for example, in the form of proof-of-stake (PoS). Therefore, it seems natural to reuse the consensus layer’s Sybil resistance mechanism at the network layer, such as sampling a node in the gossip protocol from the validator set (thus “inheriting” the consensus’s honest-majority assumption strength). While this may not immediately protect the network of nodes that are not active consensus participants, it can help create a secure “backbone” between consensus nodes (thus strengthening consensus security), and may subsequently become a better security for everyone stepping stone. A logical next step on this path would be a careful analysis of the consensus and network interactions with this shared Sybil resistance mechanism (a recent first step in this direction).
  • Improved gossip and DHT protocols: (see this survey‌)

(1) Byzantine Fault Tolerance (BFT), especially using the common Sybil resistance mechanism of the consensus layer

(2) Efficiency (especially for BFT variants, which have so far had considerable overhead and/or lower resistance)

(3) Privacy guarantees (improved guarantees, better efficiency/lower overhead)

  • Repair Mechanism:

(1) Implement repairs in a distributed fashion (erasure codes with locality?)

(2) Research and design related incentives

Acknowledgements: Special thanks to Mustafa Al-Bassam, Danny Ryan, Dankrad Feist, Sreeram Kannan, Srivatsan Sridhar, Lei Yang, Dan Robinson, Georgios Konstantopoulos, and Dan Lee for their fruitful discussions and feedback on early drafts of this article, and to Achal Srinivasan for their contributions beautiful illustrations.

Posted by:CoinYuppie,Reprinted with attribution to:https://coinyuppie.com/paradigm-challenges-and-future-rd-directions-of-public-chain-data-availability-sampling-das-technology/
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.

Like (0)
Donate Buy me a coffee Buy me a coffee
Previous 2022-08-28 23:37
Next 2022-08-29 10:52

Related articles