TopDomain dataset v2.0
No Thumbnail Available
Date
2021-05-01
Journal Title
Journal ISSN
Volume Title
Publisher
N/A
Abstract
This is the TopDomain dataset v2.0 as described in: "TopDomain: Exhaustive Protein Domain Boundary Meta-Prediction Combining Multi-Source Information and Deep Learning" by Daniel Mulnaes, Pegah Golchin, Filip Koenig, and Holger Gohlke. This dataset contains three folder:
dataset : Contains the full dataset and the TopDomain and TopDomainSeq predictions for the dataset
training_set : Contains the fasta files of the TopDomain training set
test_set : Contains the fasta files of the TopDomain test set
Each fasta file has a header with three fields, in the following format:
>system_name|domain_type|boundary_list
Where:
system_name contains the PDB ID and chain ID of the target protein
domain_type contains target type, either single-domain or multi-domain
boundary_list contains a list of residues annotated as domain boundaries
separated by spaces, this field is empty for single-domain
proteins as they have no domain boundaries
The sequence is the fasta-sequence of the protein
each line contains at most 100 residues of the protein sequence
No protein in the test set shares more than 20% sequence identity to any
protein in the training set.
Description
Keywords
Protein structure prediction, Boundary prediction