Supplementary Data.

Supplementary data to Donaldson et. al. (2005):
Genome-wide identification of cis-regulatory sequences controlling blood and endothelial development.

Alignment of SCL, FLI-1 and PRH clusters.

List of genes localised to the SCL-like Ets/Ets/GATA conserved clusters.

Information on transcription factor binding sites.
TFBS name: ETS
Bound by: Winged helix-turn-helix transcription factor family members including Elf-1 (ETS related transcription factor-1) and Fli-1 (Friend Leukemia Integration factor-1). PU.1 (a.k.a. Spi-1).
Function: The ETS family members have important roles in haematopoiesis Sharrocks and coworkers 1997) binding critically important regulatory elements in vitro and within haematopoitic progenitor cells (Gottgens and coworkers 2002). PU.1 is required in macrophage development and is required in other myeloid and lymphocytic lineages Warren and Rothenberg 2003).
Ref: Based on the core concensus sequence detailed by Sharrocks and coworkers (1997) and TRANSFAC(v6) accessions M00032 and M00074.

Bound by: Zinc finger transcription factors GATA1-3.
Function: GATA factors are key regulators of haematopoiesis (Weiss and Orkin 1995). GATA1 has been identified as a component of the SCL binding complex and GATA2 has been shown to contribute to a necessary and sufficient 3' enhancer of the SCL gene (Gottgens and coworkers 2002). GATA-1 is essential in eryroid development and is thought to participate in a mutually antagonistic role with PU.1 (Warren and Rothenberg 2003).
Ref: The GATA motif is the most widely identified binding sequence of GATA-1 (TRANSFAC(v6) accessions M0278, M00348 and M0349) and GATA-2 (TRANSFAC(v6) accessions M00126, M00127, M00128, M00203, M00346 and M00347). It should be noted that Merika and Orkin (1993) identified variation in the last position of the GATA motif.

Background to the TFBScluster analysis.
TFBScluster was designed to identify clusters of transcription factor binding sites (TFBSs) conserved in mammalian genomes. Clusters are identified containing a specified selection of TFBSs. An additional suite of programs can also provide a list of SWISS-PROT/Locuslink characterised genes to which the clusters are localised. This information may be directly used in the experimental validation of a region. All these programs (PERL scripts) are available on request.

The raw data for TFBScluster are BLASTZ/CHAINNET genome alignments held at Genome Bioinformatics (UCSC), including human/mouse, human/mouse/rat and human/chicken. Genome-wide TFBSs are identified using TFBSsearch (available on our web site) via a script that converts the downloaded data format to the FASTA format.

The currently implemented alignments include:

  • June. 2003 human assembly (also known as build 34).
  • Feb. 2003 mouse assembly (also known as MGSCv4 or mm3).
  • June. 2003 rat assembly (also known as rn3).
  • Feb. 2004 chicken assembly (also known as galGal2).

The result is a set of libraries containing all the putative sites for different transcription factors. For each TFBS (e.g., EBOX) one library is created for the core sequence 'CANNTG'. The IUPAC letter 'N' is allowed to differ between genomes. Libraries are also created to extend the 'core' binding site one to three nucleotides 5' and 3', i.e., NCANNTGN, NNCANNTGNN or NNNCANNTGNNN. In these libraries the IUPAC letter N must be the same in both genomes. By extending the degree of conservation between the aligned genomes a more specific and reduced set of TFBSs are created.

Selected TFBSs have also been screened using Regulatory Potential scores (also see the corresponding publication at PubMed). For ease of use the 5bp window scores were converted to areas covered by RP scores >= 0.0002. This is a threshold score determined by analysis of the haemoglobin beta gene locus. New TFBS library files (TFBS_filtered) were created to only include those TFBSs present in these areas.

Information for each TFBS cluster is stored in the GFF format. The start and end sites are coordinates of the human genome. The start and end positions for each TFBS relates to the 'core' sequence, for example NNGATANN - start = 3 and end = 6. Clusters are all reported on the sense strand as individual TFBSs may be on sense or complement strands. TFBSs from selected libraries are formed into clusters of a specified size. The final length of each cluster may be greater than the specified range as overlapping TFBS are combined to highlight the TFBS rich region. The UCSC genome assemblies ('builds') are also used by the Ensembl project; this connection allows annotated genes to be localised to the final TFBS clusters.

The version of Ensembl used is 19.34a.

All Ensembl annotated transcripts are localised to each cluster when a cluster is contained in a transcript, or a transcript is located within 100kb of a cluster. As a cluster may be localised to many transcripts the list is processed to identify one of two scenarios for each cluster:

  1. A cluster is situated in the intron of a transcript.
  2. A cluster is situated 5' to a transcript and/or 3' to a transcript. The nearest transcript is selected in both situations.

In order to identify the function of transcripts localised to clusters the SWISS-PROT identifier and LocusLink identifier in the Ensembl annotation are used (where available) to identify genes with characterised gene products. Anecdotally - there are more Ensembl genes with Locuslink IDs, but the genes may not have well defined functions.

The version of SWISS-PROT used by this tool is 45.0 (25 OCT 2004).

The version of LocusLink used by this tool is of November 2004.

Valid HTML 4.01! Webmaster.
Last modified: Wednesday 24 November 2004.