Supplementary data to Donaldson et. al. (2005):
Alignment of SCL, FLI-1 and PRH clusters.
List of genes localised to the SCL-like Ets/Ets/GATA conserved clusters.
Information on transcription factor binding sites.
Background to the TFBScluster analysis.
The raw data for TFBScluster are BLASTZ/CHAINNET genome alignments held at Genome Bioinformatics (UCSC), including human/mouse, human/mouse/rat and human/chicken. Genome-wide TFBSs are identified using TFBSsearch (available on our web site) via a script that converts the downloaded data format to the FASTA format.
The currently implemented alignments include:
The result is a set of libraries containing all the putative sites for different transcription factors. For each TFBS (e.g., EBOX) one library is created for the core sequence 'CANNTG'. The IUPAC letter 'N' is allowed to differ between genomes. Libraries are also created to extend the 'core' binding site one to three nucleotides 5' and 3', i.e., NCANNTGN, NNCANNTGNN or NNNCANNTGNNN. In these libraries the IUPAC letter N must be the same in both genomes. By extending the degree of conservation between the aligned genomes a more specific and reduced set of TFBSs are created.
Selected TFBSs have also been screened using Regulatory Potential scores (also see the corresponding publication at PubMed). For ease of use the 5bp window scores were converted to areas covered by RP scores >= 0.0002. This is a threshold score determined by analysis of the haemoglobin beta gene locus. New TFBS library files (TFBS_filtered) were created to only include those TFBSs present in these areas.
Information for each TFBS cluster is stored in the GFF format. The start and end sites are coordinates of the human genome. The start and end positions for each TFBS relates to the 'core' sequence, for example NNGATANN - start = 3 and end = 6. Clusters are all reported on the sense strand as individual TFBSs may be on sense or complement strands. TFBSs from selected libraries are formed into clusters of a specified size. The final length of each cluster may be greater than the specified range as overlapping TFBS are combined to highlight the TFBS rich region. The UCSC genome assemblies ('builds') are also used by the Ensembl project; this connection allows annotated genes to be localised to the final TFBS clusters.
The version of Ensembl used is 19.34a.
All Ensembl annotated transcripts are localised to each cluster when a cluster is contained in a transcript, or a transcript is located within 100kb of a cluster. As a cluster may be localised to many transcripts the list is processed to identify one of two scenarios for each cluster:
In order to identify the function of transcripts localised to clusters the SWISS-PROT identifier and LocusLink identifier in the Ensembl annotation are used (where available) to identify genes with characterised gene products. Anecdotally - there are more Ensembl genes with Locuslink IDs, but the genes may not have well defined functions.
The version of SWISS-PROT used by this tool is 45.0 (25 OCT 2004).
The version of LocusLink used by this tool is of November 2004.
Last modified: Wednesday 24 November 2004.