About CODEPLOT
CODEPLOT is committed to providing users a reliable and flexible computing platform. You can carry out automatic bioinformatics analysis without programming background. we also ensure the data security by using blockchain, multi-party secure computing and other cutting-edge technologies; CODEPLOT continuously collates and integrates the data, providing the most comprehensive data resources for researchers all over the world. Researchers can analyze more conveniently.
Advantages of CODEPLOT
(1) CODEPLOT computing environments
CODEPLOT is based on multi-level authorization control, using data storage encryption, transmission encryption, blockchain and secure multi-party computing technology to achieve data availability but invisibility, and the whole process of data and analysis to achieve blockchain certification for providing a CODEPLOT computing environment for platform users.
(2) Computing resources of Multi-omics
CODEPLOT sharing system provides a large number of datasets and bioinformatics analysis workflows commonly used. Users can also create personalized datasets and collaborative analysis with researchers or team members from different regions through shared workspace.
(3) Elastic computing component
The computing user uses flexible composite computing components to manage their own tool whenever and wherever they like.
Contact us
CODEPLOT was designed and developed by the China National GeneBank DataBase . If you need to report errors, provide features or provide feedback, please contact us .Email: CNGBdb@cngb.org
Function introduction
This platform consists of five functional modules: my workspace, datasets, tools, blockchain and help.
1. My workspace
CODEPLOT uses the workspace to build a computing sandbox, and manages the public datasets and restricted datasets in the workspace dimension. Computing users can create personalized datasets in the workspace or collaborative analysis with researchers or team members from different regions through shared space.
Workspace components include description, data, workflows, and operation results
1.1 Workspace description
Description supports markdown language editing and preview online
You can describe the research project, including what questions your project is trying to answer, what kind of data and analysis will be used, etc. Documentation is important! Good descriptions make workspaces easier to share and collaborate with people. The description page also includes information about the workspace owner and creation date.
1.2 Dataset
How to organize and use data resources? In the data tab, the left function bar contains: meta information, reference genome, files.
-
1.Meta information: Meta information is built-in similar to excel form, it help users access and use public data sets or personal data intuitively. Data is stored safely in the user object store. The data form connects the workspace metadata table to the data through the meta data link, and the meta data link points to the actual location of the data file storage on the cloud.
-
Reference genome: users can add the platform's compiled reference genome files use for analysis.
-
Files: Datasets files、calculation results、my upload
Dataset files: public dataset file imported by workspace
Calculation results: generated through analysis in this workspace
My upload:files uploaded to the workspace can only be accessed by yourself
1.3 Workflows
Workflows:Bioinformatics analysis process: the analysis process of this platform is written in WDL description language. In the workflows page, you can find the bioinformatics Analysis Process for batch analysis of corresponding datasets.
Click the workflows tab and select the workflow in the list, such as similarity analysis for biological sequences;
1.4 Operation results
users can view the list and status of all submitted analysis tasks in the corresponding workspace, the historical parameters of task running and the generated result files to ensure the repeatability and provenance of tasks in the operation results pages.
Click the "operation results" tab to view the status of batch tasks in the task list;
Click the batch task ID to view to view the subtask status;
2. Datasets
CODEPLOT collates and integrates the data and collects the data sets in the field of multi omics and multi species research. It aims to provide the most comprehensive data and information resources for researchers all over the world, so that researchers can easily carry out calculation, analysis and mining, and promote the reuse of data.
The current version integrates the following datasets:
2.1 Assembly and gene annotation of the 1000 plant transcriptomes
1000 Plant (1KP) Transcriptomes is a large-scale international multidisciplinary consortium acquiring data from over 1,000 plant species.
2.2 COVID-19 database
The COVID-19 Novel Coronavirus Sequence database integrated the released coronavirus sequence data from CNGB、NCBI、GISAID 、PDB. It provides an effective reference for studying and analyzing the evolutionary origin and pathological mechanism of COVID-19.
2.3 Single-cell Database
The single-cell database shares and integrates complex single cell datasets, and provides single cell analysis tools and visual services to facilitate researchers to access and explore the published single cell datasets.
Users can clone the workspace of common data set to construct the workspace corresponding to resource.
3. Tools
CODEPLOT builds tools of different research directions based on the existing dataset resources. In the future, CODEPLOT will support user-defined deployment tools to conduct personalized analysis and facilitate users to analyze and use data conveniently. At the same time, in order to meet the user privacy computing scenarios, the system provides multi-party security computing tools to achieve the user data available but invisible.
The platform includes the following tools:
3.1 BLAST searches on the COVID-19
The blast database is constructed based on CNGB, GenBbank and GSAID,The similarity sequences of new coronavirus can be retrieved quickly, which can provide an effective reference for analyzing the evolutionary source of COVID-19.
3.2 Single cell cluster analysis tool——scanpy
With the rapid development of new generation single cell sequencing technology, the accuracy is greatly improved, and the sequencing cost is further reduced. Thus, huge amounts of molecular biological data have been produced.Scanpy is a mainstream software package based on Python to analyze single cell data, including preprocessing, dimension reduction, clustering and other steps. In order to facilitate users to analyze single cell data conveniently,we also providing a complete process from reading matrix to clustering,It also splits each necessary step for you to debug step by step..
3.3 HMMER gene family identifying
Hmmer is widely used to search homologous protein or nucleotide sequences in related databases. Based on the alignment matrix file generated by multiple sequence alignment, Hmmer uses the algorithm of hidden Markov model to identify homologous genes. Its main uses include searching single protein sequence, multiple protein sequence alignment or searching target sequence database using hidden Markov model. Here, Hmmer is deployed to search all members of a given gene family in a gene coding dataset generated by thousands of plant transcriptome projects.
3.4 Transcriptome differentially expression genes analysis using edgeR
Edge package is mainly used to identify differential expression or differential markers by using read numbers from different technology platforms (including RNA SEQ, SAGE or ChIP-seq, etc.). It mainly uses the accurate statistical model of multi group experiments or the generalized linear model suitable for multi factor complex experiments, which is often used for transcriptome expression difference analysis. Based on the modified tool, users can provide quantitative expression files and comparative group related information to quickly get differential genes between comparison groups. In the future, we will continue to enrich the whole transcriptome process.
4. Blockchain
The platform uses blockchain to store user data files and calculation records to ensure that all relevant calculation processes and history can be traced back to the initial data and the records can not be tampered with. Users can view the overall data of the platform and calculate the storage of certificates on the blockchain page, and also can a make relevant queries through the storage address of user's personal data file.
Data management
1. Use of data sheet
CODEPLOT uses the data table to manage your data. In the metadata tab of the workspace data page, it uses the built-in method similar to excel table to conveniently reference or organize the data attributes from different data sources, including the output files of analysis. You can use data tables to store data file lists, variable names, participant names, phenotypic data, or any information you save in the table. CODEPLOTallows you to directly fill in or change table elements in the interface, as well as add new metadata tables by uploading tab delimited TSV suffix files.
1.1 data sources
The data in the workspace is stored in the object bucket on the cloud. The CODEPLOT isolates the data of each user in the bucket, and the data of each user is stored in the user's personal bucket. Users can fill in the complete path of data in the object storage in the form elements, so as to link and manage the existing bucket files through the data table link, which is convenient to use the workspace data table as input data through the configuration of workflow.
1) Header of table;
2)Property or metadata row. Each row corresponds to a different entity(for example:sample, length, or file).
Note: the data table needs at least two column: the ID column. You can include other columns (for example, associated phenotypic information), and the data table organizes the information in a table. You can also configure workflow parameters to write links to output files back to the workspace table, which is useful for downstream analysis.
1.2 Function of Data table
1) Organize a large number of samples
The data table can contain all the information needed, including the intermediate output data and the relationship between the data. It is more efficient to organize complex studies through tables; for example, if you have a large sample of one participant or many patients in a study.
2) Batch analysis
In the follow-up workflow analysis, users can select multiple pieces of data and use 'this.' + table column name in the subsequent parameter page to easily build batch analysis of different data corresponding to different parameters, providing convenience for user personalized analysis.
1.3 Modify table elements
If you only need to change a small number of entries, you can change them directly in the interface in the "meta data" tab. For example, if the workspace already contains a data table (at least one row of data), you can edit a single cell by clicking the pencil icon in the cell you want to change.
1.4 Overlay or add new forms
If you need to overlay an existing or new table in the workspace table, click the blue "+" at the top right of "meta data" (in the rectangle in the screenshot below), and follow the instructions to upload the table. Note that when uploading a new form, please set a table name different from the existing one. If it has the same name as an existing table, you will be prompted to override the original table. Please operate with caution.
2. Data upload
Two methods are supported to upload data / files to users’ own workspace, including data upload in web browsers or using file transfer software object storage browser (OBS Browser+). The entry of data upload can be found in an exist workspace or a newly created workspace according to the following steps: click “Data -> Files-> My upload -> upload.
2.1 Upload data in web browsers
Select “browser upload” option in upload file interface, and upload files by dragging local files into the upload dash box or click the cloud figure to select local files. It is possible to upload five files at most each time, with the size limitation of 100 M for each uploading file.
2.2 upload data through OBS Browser+
The client OBS Browser+ is required to be downloaded and installed before local files could be transferred to targeted path in the object storage with the access key ID and secret access key given in the “transfer tool upload” interface.
-
Please select the AK method to login with the given authorization code information. User guidance and instruction could be found in the support web page.
-
Please keep your AK/SK information secure to avoid data loss.
Run jobs
1. Create / clone workspace
All data analysis in this platform are conducted at the dimension of workspace comprising datasets, workflow and job monitor. If you are about to start the research, you need to create a workspace first. Two ways to create a workspace are:
1.1 Create a workspace
Click the button “create workspace” to start the process of creating a new workspace in the “my workspace” web page. A workspace title and summary are required in the workspace creating form before click the submit button to complete creating a workspace.
1.2 Clone a workspace
A workspace associated workspace could be cloned or created by clicking the “clone” button at the bottom right in the dataset web page after an associated dataset is selected.
2. Configure parameters
Input files and parameters could be configured in the configuration form composed of four columns including task name, attribute name, type and value
2.1 Select the data from the meta data table in the workspace
If you want to use the information of multiple pieces of data that you selected on the data selection page before, you can use 'this'+ to have the column names in the table, such as 'this.sequence_id', which means you correspond to the "sequence_id" column in the workspace table The value element serves as the parameter value. Note Please ensure that the value type of the column element you select is the same as the parameter type to avoid errors in the process.
2.2 Configure files from the object storage bucket
Parameters configuration could be set in the two methods below:
The corresponding parameter type is the parameter value of the file type. There will be a folder button on the left side of the value box, as shown in the following figure:
Click the folder button, and the following box will pop up. You can select the public data set file referenced by the workspace or select the file you need to upload before as the input file. Note that if this mode is selected for the file and "this. mode" is not selected as the column name in the parameter list, the same file will be run multiple times.
2.3 Configure output parameters
The corresponding parameter value on the output parameter configuration page corresponds to the result file in the corresponding WDL workflow. If you want to backfill the output result file to the corresponding column of the previous selection data, you can refer to the sample picture to fill in the result file by filling in "this.outfile". The corresponding path is backfilled into the outfile column of the table. Note that if the column exists, the corresponding column value will be overwritten, and if it does not exist, the corresponding column value will be added. If you don’t want to fill in the form, you can default to the next step
2.4 Submit jobs
Click the "Run" button on the page of configuring the output parameters, and the running page will prompt you that the task has been submitted, as shown in the figure below. You can click to view the running result to enter the running result page. You can view task running status and task prompts.
3. Task monitoring and debug
The calculation result page contains all running batch records in the workspace. The information record contains: task ID, process name, batch task quantity, batch task status, creation time and other information.
3.1 Calculate task status
Workflows have the following states: waiting, running, success, and error.
-
Waiting: The task has exceeded submission and is waiting in queue.
-
Running: the task is running
-
Success: The task has been successfully completed.
-
Error: The task runs wrong
3.2 Subtask operation information
On the subtask details page, the creation time, time-consuming, status of the subtask, and the input and output parameter content of the subtask during the submission process will be prompted.
3.3 Task debug
When the task fails, the platform has two parts of logs.
(1) Task log information:
When the task submission fails or the operation fails, the subtask status will return to the simple error log. You can move the mouse to the "error" font in the status column to display the task error log information. As shown below:
(2) Tool program log:
As the task log information is generally short, the information cannot meet your requirements sometimes. You can view it in the "Calculation Results" folder under "File" in the "Data" tab of the workspace. The specific directory structure is: batch task ID/subtask ID/task name/execution/; it will be generated in the corresponding directory. The specific log file of each task process of the tool: stdout or stderr. if it does not exist, the file is not generated. The structure is shown in the figure below:
Run Jupyter
Jupyter Notebook is a interactive computing interactive computing base on web, you can run a notebook analysis in CODEPLOT.
1.Enter Notebook page in workspace
Create a new workspace/open a workspace。there is a "Notebook" tab in workspace page。
2.Upload Notebook file
Click "+" button upload/add Notebook file.
3.Select a Cloud Environment
Click "run" button ,select a Cloud Environment to run Notebook server.click "confirm" button,run a notebook analysis after a moment。
4.Runs successfully
Run interactive analysis in Jupyterlab page,If you are not familiar with Jupyterlab, please refer to thedocumentation。
5.Restrictions
-
- 50 hours of operation per month, and the single operation time limit is 2 hours.
-
Only support **2 core 4GB resources ** resources and the assigned docker image environments , if you need specific requirements please contact us。
Blockchain explorer
The input data files uploaded by users to CODEPLOT for computation and result files after successful execution will be recorded on chain. Users can query the records on blockchain with the certificate address.
1. Query your task in blockchain explorer
Enter My Workspace -
Note: When the address icon is gray and cannot be clicked, it means that the file has not been used for computation thus has not been recorded on chain. This applies to all following queries.
2. Query of computation results and uploaded documents in blockchain explorer
Enter My Workspace -
3. View the list of all record addresses
Visit the "Blockchain Explorer" page and browse all the certificate addresses at the bottom half of the page. Click on the address you would like to query, and the corresponding record will be highlighted in the explorer.
Secure environment
1. Container Computing Environment
The platform computing is built with container technique, which features in container-level startup speed and virtual machine-level security isolation. It has the following characteristics:
· Native support for Kata Container
· Kata-based kernel virtualization providing comprehensive security isolation and protection
· High-performance and secure container achieved through hardware virtualization acceleration
2. Secure multi-party computation
Under the existing health and medical data management model, due to the differences in information systems and the conflicts of interests, it is usually difficult for different institutions to share data efficiently and safely.
The platform uses cutting-edge secure multi-party computing techniques, which allows computation without revealing each party’s input data, thus makes the data available but invisible. It features in:
1)Input privacy: The input of each participant is strictly protected to ensure data independence and data privacy.
2)Decentralization: The algorithm performs decentralized processing to ensure that all parties have equal power, and there is no privileged party.
3. Data Security
Platform data is stored in stable and secure object storage. Object storage (OBS) supports HTTPS/SSL security protocols, and all user data will be encrypted and stored. At the same time, user identity authentication through the access key (AK/SK), IAM permissions, bucket policies, ACLs, anti-theft chain and other techniques are employed to protect the security of data transmission and access. With the five-level reliability architecture, the data durability reaches 99.999999999%, and the business continuity reaches 99.99%, which are much higher than the traditional architecture.