PaddleDTX is a distributed machine learning technology solution based on distributed storage. It can solve the problem of secure storage and exchange of massive private data, and can maliciously help all parties break through data silos to maximize data value.
The computing layer of PaddleDTX is a network consisting of three types of nodes: Requester, Executor and DataOwner. Training samples and prediction datasets are stored in a decentralized storage network consisting of DataOwner and Storage nodes. This decentralized storage network and computing layer are backed by the underlying blockchain network.
Multiparty Computational Network
The Requester is the party that has the demand for prediction, and the Executor is the party authorized by the DataOwner to obtain the access permission of the sample data for possible model training and result prediction. Multiple Executor nodes form an SMPC (Secure Multi-Party Computation) network. The Requester node publishes the task to the blockchain network, and the Executor node executes the task after authorization. The Executor node obtains sample data through DataOwner, which endorses the trust of the data.
An SMPC network is a framework that supports multiple distributed learning processes running in parallel. In the future, vertical federated learning and horizontal federated learning algorithms will be supported.
Decentralized Storage Network
A DataOwner node processes its own private data, using encryption, segmentation and replication related algorithms in the process, and finally distributes the encrypted shards to multiple Storage nodes. A Storage node proves that it is honestly holding a piece of data by answering a challenge generated by the DataOwner. Through these mechanisms, storage resources can be securely maintained without violating any data privacy.
Training tasks and prediction tasks will be broadcast to Executor nodes through the blockchain network. The Executor nodes involved will then execute these tasks. DataOwner nodes and Storage nodes exchange information through the blockchain network while monitoring the health status of files and nodes, as well as during the challenge-answer-verify process of proof-of-replica ownership.
Currently, XuperChain is the only blockchain framework supported by PaddleDTX.
Vertical Federated Learning
The open source version of PaddleDTX supports Vertical Federated Learning (VFL) algorithms, including two-party linear regression, two-party logistic regression, and three-party DNNs (deep neural networks). The implementation of DNN relies on the PaddleFL framework, and all neural network models provided by PaddleFL can be used in PaddleDTX. In the future, more algorithms will be open sourced, including multi-party VFL and multi-party HFL (horizontal federated learning) algorithms.
The training and prediction steps are as follows:
How it works
FL tasks need to specify sample files that will be used for computation or prediction, which are stored in a decentralized storage system (XuperDB). Before executing the task, the executor (usually the data owner) needs to get its own sample file from XuperDB.
Both VFL training and prediction tasks require a sample alignment process. That is, use the list of all participants’ IDs to find sample intersections. Training and prediction are performed on intersecting samples. The project implements PSI (Private Set Intersection) for sample alignment without revealing any participant IDs.
Model training is an iterative process that relies on the cooperative computation of two parity samples. Participants need to exchange intermediate parameters over many training epochs in order to obtain an appropriate local model for each party.
To ensure the confidentiality of each participant’s data, the Paillier cryptosystem is used for parameter encryption and decryption. Paillier is an additive homomorphic algorithm that allows us to directly add or scalar multiply ciphertexts.
The prediction task requires a model, so the related training task needs to be completed before the prediction task starts. Models are stored separately in the actor’s local storage. Participants use their own models to calculate partial predictions and then collect all partial predictions to derive the final result.
For linear regression, a denormalization process can be performed after all partial results have been collected. This process can only be done by the side with the label. So all partial results are sent to the tagged party, which deduces the final result and stores it as a file in XuperDB for the requester to use.
Posted by:CoinYuppie，Reprinted with attribution to:https://coinyuppie.com/paddledtx-a-distributed-machine-learning-solution/
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.