Links for different datasets used in this study:
How datasets were built
1) Dataset of Positives extracted from PDB: The sequences were added to the positive dataset if their stuctures contain iron-sulfur (Fe-S) clusters ligated to the polypeptide chain. The list of chemical components used to identify Fe-S clusters includes: FES, F4S, SF4, F3S, SF3, FS5, CLF, CLP, 1CL. Sequences that bind individual iron ions via four cysteine residues were also added to positive dataset.
2) Dataset of Negatives extracted from PDB: The set of negatives comprises sequences that don't bind Fe-S clusters. This was obtained by several steps:
- Removing from the PDB dataset all the sequences that show a sequence identity greater than 30% to iron-sulfur proteins
- Extracting from this ensemble structures that either bind a transition metal ion or are not included in MetalPDB (i.e. have a sequence identity lower than 30% to any structure contained in MetalPDB) but have at least four cysteines in the sequence. Indeed, Fe-S binding sites are more often formed by four cysteines. In the table below we listed all the datasets that are merged in the negative dataset
3) Escherichia coli proteome: in this proteome each fasta sequence is tagged with "pos_" if the protein has been reported to be a Fe-S protein (see Table S2 in the supplementary material) or with "neg_" if there is no published evidence that it is a Fe-S protein.
Developed at CERM - University of Florence