a16z: Measure SNARK performance with three factors

This article is from a16z , original author: Justin Thaler, compiled by Odaily Planet Daily translator Katie Koo.

a16z: Measure SNARK performance with three factors

SNARKs (Succinct Non-Interactive Parameters of Knowledge) are an important cryptographic primitive for discovering applications of blockchain scalability (e.g. L2 Rollup) and privacy. SNARKs allow someone to prove to an untrusted validator V (such as the Ethereum blockchain) that they know some data (such as a batch of valid transactions). An easy way to prove this is to send the data to V, which can then directly check its validity. SNARK achieves the same effect, but at a higher cost for V. In particular SNARK proofs should be shorter than the original proof containing the data itself.

Verification costs for SNARKs can vary, but the results are generally good. For example, PlonK’s proofs cost about 290,000 gas to verify on Ethereum, while StarkWare’s proofs cost about 5 million gas. SNARKs may be suitable for a variety of different settings outside the blockchain, for example, allowing the use of fast but untrusted servers and hardware.

But since SNARK verification is usually cheap, the main determinant of applicability is the cost of the SNARK prover P. This article explains how to estimate these costs to determine when to use SNARKs appropriately, and how SNARKs can be improved in the future. It is worth noting that this is a rapidly evolving field, and several projects discussed in this article are actively improving their performance.

How are SNARKs deployed?

In a SNARK deployment, developers typically write a computer program ψ that takes as input the data w that the prover claims to know (w stands for the verifier), and checks whether w is valid. For example, in a Rollup, the program would check if all transactions in w were digitally signed, if they resulted in any account balances falling below zero, etc. The program ψ is then input through the SNARK front-end, which compiles the program into a format more suitable for the application of SNARK technology. This SNARK-friendly format is called Intermediate Code (IR).

In general, IR is some kind of circuit satisfiability instance equivalent to ψ. This means that circuit C takes the data w as input, and some additional inputs commonly called “non-deterministic proposals”, and operates ψ on w. These suggested inputs are used to help C run ψ while keeping C small. For example, whenever ψ is divided by two numbers x and y, the quotient q and remainder r can be provided as advice to C, which can simply check that x = qy + r. This check is cheaper than having C run the division algorithm to compute q and r from scratch.

Finally, apply the satisfiability circuit SNARK to C. This is called the SNARK backend. For some highly structured problems, such as matrix multiplication, convolution, and some graph problems, known SNARKs can avoid this front-end/back-end paradigm, enabling faster proofs. But the focus of this article is on generic SNARKs.

As we will see, the SNARK backend verification cost grows with the size of C. Keeping C as small as possible is a challenge because circuits are an extremely limited computational format. They consist of doors, connected by wires. Give each gate g some values ​​and apply a very simple function to those values. The result is then fed into the “downstream” gate through a wire from g.

How long does it take for SNARKs to scale?

The key question is, how long does the SNARK prover take relative to simply re-executing ψ on the data? The answer is SNARK’s verifier, which is checked against direct verifiers. The latter expression means that in the original proof that P sent w to V, V checks the validity of w by performing ψ on w.

It is advantageous to divide the prover overhead into “front-end overhead” and “back-end overhead”. If computing circuit C gate-by-gate costs F times as much as running ψ, then we call the front-end overhead F. If applying a backend validator to C costs B times as much as evaluating C gate by gate, then we call the backend overhead B. The total validator overhead is F times B. Even if F and B are each small, this multiplication can be expensive.

In fact, both F and B can be 1000 or more. This means that the total cost of the attestation process may be 1 million to 10 million or more relative to direct validator verification. A program that runs for a second on a laptop can easily lead to a SNARK validator requiring tens or hundreds of days of computation time (single thread). Fortunately, this work is often parallelized to varying degrees (depending on the SNARK).

In summary, if you want to use SNARKs in today’s applications, one of the following three conditions must be met:

1. Verification of validators directly on a laptop takes less than a second.

2. Direct verifier verification is especially suitable for in-circuit, so the overhead of the front-end is small.

3. You are willing to wait days for the SNARK validator to complete, and/or pay for huge parallel computing resources.

The rest of this article explains where the front-end and back-end overhead comes from, and how I estimate the different SNARKs. and prospects for future improvement.

Distinguish front-end and back-end

Because different backends support different types of circuits, it can be a challenge to fully differentiate the frontend from the backend. So frontends can vary based on the backend they expect to interact with.

SNARK backends typically support so-called arithmetic circuits, which means that the input to the circuit is an element of some finite field, and the gates of the circuit perform the addition and multiplication of two field elements. These circuits are roughly equivalent to straight-line computer programs (without branches, conditional statements, etc.), which are algebraic in nature, that is, their primitive data types are field elements.

Most backends actually support most arithmetic circuits, often referred to as Rank-1 Constraint Satisfaction (R1CS) instances. In addition to Groth16 and its predecessor, these SNARKs can also be used to support other IRs. For example, StarkWare uses a method called Algebraic Intermediate Representation (AIR), which is similar to the PlonKish arithmetic operations supported by PlonK and other backends. The ability of some backends to support more generic IRs can reduce the overhead of frontends that generate those IRs.

Backends also vary by the limited fields they support natively. I will discuss this further in the next section.

Multiple Approaches to the Front End

Some (very special) computer programs naturally correspond to arithmetic circuits. An example is a computer program that performs simple multiplication of matrices over some field. But most computer programs are neither straight lines nor algebraic. They usually involve conditional statements, operations such as integer division or floating-point operations that do not naturally correspond to finite field operations, etc. In these cases, the front-end overhead will be considerable.

A popular front-end approach is to generate basic circuits that step through some simple CPUs, also known as virtual machines (VMs). Front-end designers specify a basic set of operations, similar to the assembly instructions of an actual computer processor. Developers who want to use the front end can either write a “verifier checker” directly in assembly language, or write a “verifier checker” in a high-level language such as Solidity, and have their programs automatically compiled into assembly code.

For example, StarkWare’s Cairo is a very limited assembly language with assembly instructions that roughly allow addition and multiplication over finite fields, function calls, and reads and writes to immutable (ie write-once) memory. Cairo VM is a von Neumann architecture, which means that the front-end generated circuit essentially takes the Cairo program as a common input and runs that program in front of the witness. The Cairo language is Turing complete – despite its limited instruction set, it can emulate more standard architectures, although doing so can be expensive. The Cairo front end converts a Cairo program that executes T primitive instructions into what is called “Level 2 AIR, T rows, about 50 columns”. It doesn’t really matter what exactly this means, but as far as SNARK proofs are concerned, this is equivalent to a circuit with 50 to 100 gates per T-step of the Cairo CPU.

RISC Zero takes a similar approach to Cairo, but its virtual machine is the so-called RISC-V architecture, an open source architecture with a rich software ecosystem that is growing in popularity. Being a very simple instruction set, designing an efficient SNARK front-end to support it is probably relatively tractable (at least relative to complex architectures like x86 or ARM). As of May, RISC Zero is converting programs that execute raw RISC-V instructions into a 5-level AIR with 3 rows and 160 columns. This corresponds to a circuit with at least 500 gates per step of the RISC-V CPU. Further improvements are expected in the near future.

Various zkEVM projects (eg, zkSync 2.0, Scroll, Polygo’s zkEVM) use virtual machines as Ethereum virtual machines. The process of converting each EVM instruction into an equivalent “gadget” (that is, an optimized circuit that implements the instruction) is much more complex than the simple Cairo and RISC-V architectures. For various reasons, some zkEVM projects do not directly implement the EVM instruction set, but instead compile high-level Solidity programs into other assembly languages ​​before turning them into circuits. Performance results for these projects have yet to be determined.

Projects such as RISC-V and Cairo’s “CPU Simulator” produce a single circuit that can process all programs in the relevant assembly language. Another approach is something like an ASIC chip that generates different circuits for different programs. This ASIC-like approach can produce smaller circuits for some programs, especially when the assembly instructions that the program executes at each time step do not depend on the program’s input. For example, it can potentially completely avoid the front-end overhead of straight-line procedures such as initial matrix multiplication. But ASIC’s approach also seems to be very limited/as far as I know how to use it to support loops with no predetermined iteration boundaries.

The final component of the front-end overhead comes from the fact that all SNARKs use circuits that operate over finite fields. The CPU on your laptop can multiply or add two integers with a single machine instruction. If the field characteristics of the front-end output circuit are large enough, it can essentially simulate multiplication or addition with corresponding field operations. However, implementing field operations on a real CPU typically requires many machine instructions (although some modern processors do have native support for some fields).

Some SNARK backends support more flexible field selection than others. For example, if the backend uses encryption group G, the fields of the circuit must match the number of elements in G, which may be limited. Also, not all domains support practical FFT algorithms.

There is only one implemented SNARK, Brakedown, that natively supports computation of arbitrary (large enough) fields. Except for its next generation, it has the fastest known concrete verification performance on other SNARK-supported fields, but currently its proofs are too large for many blockchain applications. Recent work has attempted to improve the size of proofs, but the provers are slow and seem to have obstacles in terms of practicality.

Some projects choose to work in areas where computation is particularly fast. For example, Plonky2 and other algorithms use a feature domain of 264 – 232 + 1 because the algorithm implementation for this domain is several times faster than for unstructured domains. However, using such small features can lead to challenges in efficiently representing integer arithmetic through field operations. (For example, multiplying a 32-bit unsigned integer by any number greater than 232 may yield results larger than the field’s characteristics. Therefore, field selection naturally only supports arithmetic on 32-bit numbers.)

But anyway, for all the popular SNARKs today, to achieve 128-bit security (without significantly increasing the verification cost), they must work on domains larger than 2128. As far as I can tell, this means that each field operation will require at least 10 64-bit machine multiplications, and quite a few additions and bit operations. Therefore, front-end overhead of at least an order of magnitude should be considered due to the need for circuits operating on finite fields.

All in all, existing front-ends using virtual machine abstraction produce circuits of 100 to 1000 gates at each step of the virtual machine, possibly many more for more complex virtual machines. On top of that, finite field arithmetic is at least 10 times slower than similar instructions on modern processors (there may be exceptions if the processor has built-in support for some fields). An “ASIC chip front-end approach” might reduce some of this overhead, but is currently limited to the types of programs it can support.

What is the backend bottleneck?

SNARKs for satisfiability circuits are typically designed by combining an information-theoretic security protocol called “polynomial IOP” and a cryptographic protocol called “polynomial commitment scheme”. In most cases, the specific bottleneck of the prover is the polynomial commitment scheme. In particular, these SNARKs allow the prover to cryptographically submit one or more polynomials of degree (at least) |C|, the number of gates in circuit C.

Accordingly, the specific bottleneck in the polynomial commitment scheme will depend on the scheme and circuit size used. But always one of three operations: computing an FFT, exponentiation in an encryption group, or a Merkle hash. Merkle hashing is usually only a bottleneck in very small circuits, so we won’t discuss it further.

Polynomial commitment based on discrete logarithms

In a polynomial commitment based on the hardness of discrete logarithmic problems in cipher group G (KZG, BulletProof, Dory, and Hyrax commits), the prover must compute the Pedersen vector commitment of the polynomial coefficients. This involves multiple exponentiation operations whose magnitude is equal to the degree of the polynomial. In SNARKs, this degree is usually the size |C| of circuit C.

Briefly, exponentiating the size |C| multiple times requires about 1.5 · 124 ≈ 400 · |C| group operations, where 124G| – represents the number of elements in the group G. However, there is a method called Pippenger’s algorithm that reduces this by approximately log|C|. For large circuits (eg |C| ≥ 225), this log |C| factor can be specified as 25 or greater, that is, for large circuits, we would expect the Pedersen vector combination to pass through a group of slightly more than 10·124C124 operation to calculate. In turn, each set of operations tends to be around 10 times slower than a finite field operation (as a very rough ballpark). SNARKs using these polynomial commitments are as expensive for P as about 100·|C| field operations.

But existing SNARKs have additional overhead beyond the multiples of 100. E.g:

  • Spartan’s proof uses Hyrax polynomial commitments, which must be done to |C|½ multiples of power, each of size |C|½, weakening Pippenger’s algorithm with a speedup of about 2x.
  • In Groth16, P must work on pairs that are pair-friendly, and its operations are typically at least twice as slow as those that are not pair-friendly. P must also be executed to powers of 3 instead of 1. Taken together, this results in at least an additional 6x slowdown relative to the 100·|C| estimated above.
  • Marlin and PlonK also require pairings, and their provers commit to much more than 3 polynomials.
  • For any SNARK that uses bullet proofs (e.g. Halo2, which is roughly PlonK, but uses bullet proofs instead of KZG polynomial commitments), the prover must compute logarithmic powers in the “start” phase of the commitment scheme, which Any Pippenger acceleration is largely eliminated.

In conclusion, known SNARKs using Pedersen vectors promise at least 200 times the backend overhead and up to 1000 times or more.

Other Polynomial Commitments

For SNARKs using other polynomial commitments such as FRI and liigero-commit, the bottleneck is the need for the prover to perform large FFTs. For example, in a level 2 AIR produced by Cairo (51 columns and T rows, Cairo CPUs have 1 row per step), the StarkWare-deployed verifier performs at least 2 FFTs per column with lengths between 16·T and 32·T between. Constants 16 and 32 depend on the FRI internal parameters set by StarkWare and can be reduced at the cost of increased verification cost.

Optimistically, an FFT of length 32 requires approximately 64·T·log(32T) field multiplications. This means that even if the value of T is relatively small (eg T ≥ 220), the number of field operands per column is at least 64·25·T=1600·T. So the backend overhead seems to be at least in the thousands. Also, the bottleneck for large FFTs may be memory bandwidth rather than field operations.

In some contexts, the backend overhead of SNARKs that perform large FFTs can be mitigated by a technique called proof aggregation. For Rollup, this means that P (the Rollup service) breaks a large batch of transactions into 10 smaller batches. For each mini-batch i, P generates a SNARK proving the batch validity of πi. But P did not publish these proofs to Ethereum because that would increase the gas cost by a factor of nearly 10. Instead, SNARK is applied again, this time producing a proof of π that P knows π1, π10. That is, the ten witnesses that P claims to know are π1, . This simple proof of π is published to Ethereum.

In polynomial commitments such as FRI and liigero-commits, there is a strong tension between P time and V cost, and the internal parameter acts as a knob to trade off between the two. Since π1, . Only in the final application of SNARK, can π1, π10 be aggregated into a single proof π, and the commitment scheme needs to be configured to guarantee small proofs.

StarkWare plans to deploy evidence aggregation immediately. This is also the focus of projects such as Plonky2.

What are the other bottlenecks in SNARK scalability?

This paper focuses on verification time, but other verification costs can also be bottlenecks for scalability. For example, with many SNARK backends, the prover needs to store multiple field elements for each gate in C, and this space overhead can be significant. A program ψ that runs for one second on a laptop can perform 1 billion primitive operations on a modern processor. In general, C requires more than 100 gates to do such an operation. That means 100 billion gates, which depending on the SNARK could mean that P has tens or hundreds of terabytes of space.

Another example is that many popular SNARKs (e.g., PlonK, Marlin, Groth16) require a complex “trusted setup ceremony” to generate a structured “proof key”, which must be stored by the prover. As far as I know, the largest such ceremony produces a proof key capable of supporting a circuit, about 228 ≈ 250 million gates. The key to the proof is tens of gigabytes in size.

These bottlenecks can be appropriately breached in environments where the likelihood of aggregation is proven to be high.

Looking to the future: The prospect of a more scalable SNARK

Both front-end and back-end overhead can be three orders of magnitude or more. Can we expect these numbers to drop significantly in the near future?

To some extent, I think we will. First, today’s fastest backends (such as Spartan and Brakedown and other SNARKs that combine sum-checking protocols and polynomial commitments) have a large amount of proof – usually the square root of the circuit size, so people don’t really use them. I expect to see a significant reduction in verification size and verification time in the near future with deep-one synthesis and small verification hurdles. Similar to proof aggregation, this means that the prover will first use the fast prover, the big prover SNARK to generate a SNARK to prove π, but will not send π to V. Instead, P will generate a proof π’ using a small proof SNARK, prove that it knows π, and send π’ to V. This can greatly reduce the current popular SNARK backend overhead.

Second, hardware acceleration can help. A very simple rule is that GPUs can be bought 10 times faster than CPUs, and ASICs can be bought 10 times faster than GPUs. However, I have three concerns. First, the bottleneck for large FFTs may be memory bandwidth rather than field operations, so SNARKs performing such FFTs may have limited speedup on dedicated hardware. Second, while this paper focuses on the polynomial commitment bottleneck, many SNARKs require the prover to do other operations that are only slightly less expensive than the polynomial commitment bottleneck. So breaking the polynomial commitment bottleneck (eg speeding up multiple powers in a discrete log based SNARK) may leave a new bottleneck operation that is not much better than the old one. (For example, SNARKs include Groth16, Marlin, and PlonK, and in addition to multiple powers, there are also verifiers for FFTs.) Finally, the fields and elliptic curves that lead to the most efficient SNARKs may continue to evolve for some time, which may give rise to ASIC-based Acceleration of verification brings challenges.

On the front end, we may increasingly find that the “CPU emulator” approach of projects like Cairo, RISC Zero, zkEVM, etc. can actually scale (in terms of performance) the complexity of the CPU instruction set quite well. In fact, this is the hope of various zkEVM projects. This could mean that, while the front-end overhead is still three orders of magnitude or more, the front-end can support virtual machines that increasingly match the real CPU architecture. Conversely, a concern is that as hand-coded gadgets proliferate for increasingly complex instructions, the front end may become complex and difficult to audit. Formal verification methods are likely to play an important role in addressing this issue.

Finally, at least in blockchain applications, we may find that most of the smart contracts appearing in the non-mainstream mainly use simple, SNARK-friendly instructions. This may reduce front-end overhead in practice, while retaining the generality and improved developer experience that comes with support for high-level programming languages ​​such as Solidity and a rich instruction set including EVM.

Posted by:CoinYuppie,Reprinted with attribution to:https://coinyuppie.com/a16z-measure-snark-performance-with-three-factors/
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.

Like (0)
Donate Buy me a coffee Buy me a coffee
Previous 2022-08-12 10:54
Next 2022-08-12 10:58

Related articles