Skip navigation
Please use this identifier to cite or link to this item:
Title: TopDomain dataset
Authors: Mulnaes, Daniel
Golchin, Pegah
Koenig, Filip
Gohlke, Holger
Keywords: Protein structure prediction
Boundary prediction
Issue Date: 2021
Publisher: N/A
Abstract: This is the TopDomain dataset as described in: "TopDomain: Exhaustive Protein Domain Boundary Meta-Prediction Combining Multi-Source Information and Deep Learning" by Daniel Mulnaes, Pegah Golchin, Filip Koenig, and Holger Gohlke. This dataset contains two folder: training_set : Contains the fasta files of the TopDomain training set; test_set: Contains the fasta files of the TopDomain test set. Each fasta file has a header with three fields, in the following format: ">system_name|domain_type|boundary_list". Where: system_name contains the PDB ID and chain ID of the target protein; domain_type contains target type, either single-domain or multi-domain; boundary_list contains a list of residues annotated as domain boundaries separated by spaces, this field is empty for single-domain proteins as they have no domain boundaries. The sequence is the fasta-sequence of the protein, each line contains at most 100 residues of the protein sequence. No protein in the test set shares more than 20% sequence identity to any protein in the training set.
Appears in Collections:Computational Pharmaceutical Chemistry and Molecular Informatics Group

Files in This Item:
File Description SizeFormat 
topdomain_dataset_1.0.tar.gz1.12 MBUnknownView/Open
Show full item record

This item is licensed under a Creative Commons License Creative Commons