Datasets download:

Links for different datasets used in this study:

Description Redundant Dataset (<100% id)Not Redundant Dataset (<25% id)
Dataset of POSITIVES extracted from PDB (i.e. Fe-S binding structures)download
(2390 sequences)
download
(163 sequences)
Dataset of NEGATIVES extracted from PDB (see below for construction)-download
(2607 sequences)
Escherichia coli proteome (sequences are marked as "pos_" if they bind Fe-S or with "neg_" if not)download
(4431 sequences: 149 positives, 4282 negatives)


How datasets were built

1) Dataset of Positives extracted from PDB: The sequences were added to the positive dataset if their stuctures contain iron-sulfur (Fe-S) clusters ligated to the polypeptide chain. The list of chemical components used to identify Fe-S clusters includes: FES, F4S, SF4, F3S, SF3, FS5, CLF, CLP, 1CL. Sequences that bind individual iron ions via four cysteine residues were also added to positive dataset.

2) Dataset of Negatives extracted from PDB: The set of negatives comprises sequences that don't bind Fe-S clusters. This was obtained by several steps:
- Removing from the PDB dataset all the sequences that show a sequence identity greater than 30% to iron-sulfur proteins
- Extracting from this ensemble structures that either bind a transition metal ion or are not included in MetalPDB (i.e. have a sequence identity lower than 30% to any structure contained in MetalPDB) but have at least four cysteines in the sequence. Indeed, Fe-S binding sites are more often formed by four cysteines.
In the table below we listed all the datasets that are merged in the negative dataset

3) Escherichia coli proteome: in this proteome each fasta sequence is tagged with "pos_" if the protein has been reported to be a Fe-S protein (see Table S2 in the supplementary material) or with "neg_" if there is no published evidence that it is a Fe-S protein.

Description Redundant Dataset (<100% id)Not Redundant Dataset (<30% id)
Dataset of NOT-binding structures with four CYSdownload
(4338 sequences)
download
(579 sequences)
Dataset of ZINC-binding structuresdownload
(17986 sequences)
download
(2422 sequences)
Dataset of MANGANESE-binding structuresdownload
(4855 sequences)
download
(563 sequences)
Dataset of COBALT-binding structuresdownload
(1020 sequences)
download
(246 sequences)
Dataset of COPPER-binding structuresdownload
(2451 sequences)
download
(194 sequences)
Dataset of MOLYBDENUM-binding structuresdownload
(113 sequences)
download
(20 sequences)