Here is problem:
The focus of the Darwin2000 site is the Human beta globin region on chromosome 11 (LOCUS: HUMHBB) which is a ~73kb hunk of genomic DNA that contains within it, a 45 kb cluster which contains the coding regions for the five beta-like globin genes in the following order: 5'-epsilon -G-gamma -A-gamma -delta -beta-3'. There is an additional gene (actually a pseudo-gene), beta-1, located between the A-gamma and delta genes. The features of each of the genes clearly indicate which segments of genomic DNA are 'joined' to give rise to the coding regions (exons) for each protein.
It is proposed that learning the art of gene-finding is an important first step for an aspiring bioinformatacist. Thus the segment of gDNA indicated as containing the exons and other important coding sequence for HBgG is generally within the positional range of 34000-36000. Thus a search of the 3-frame translation of this segment of gDNA should reveal the coding segments for the different exons. The known peptide sequence will be used as a reference to aid in the 'initial' definition of the coding regions. The final and precise definition of the exons requires analyses of the putative exon/intron boundaries for the preservation of the codon structure for the terminal amino acid of each exon and the corresponding features of the 5'- and 3'-ends (GT-AG) of the internal introns.
-------------------------------
The promoter region sequences 'ccaat', 'ata' and 'cttctg' found at bases 34390, 34448 and 34485 in the hbgg gene,
precursor_RNA 34478..36069 /note="G-gamma globin"
Polyadenylation signal hbgg positions 36043-36048
CDS hbgg mRNA join(34531..34622,34745..34967,35854..35982)
exon 1 <34531..34622
intron-1 34623..34744
exon 2 34745..34967
intron-2 34968..35853
exon 3 35854..>35982
---------------------above info lifted from Darwin2000
start with 34,000-36000 2kb of gDNA
TACG 90 Char small http://24.1.175.29/tacg3/tacg300.Rest.form.html
For
pasted-in sequence, labelled 34000-36000hbgg
==
Sequence info:
2100 bases; 583 A(27.76 %) 430 C(20.48 %) 535 G(25.48 %) 552 T(26.29 %)
1 cctatgcctaaaacacatttcacaatccctgaacttttcaaaaattggtacatgctttaa 60
1 P M P K T H F T I P E L F K N W Y M L *
2 L C L K H I S Q S L N F S K I G T C F N
3 Y A * N T F H N P * T F Q K L V H A L T
61 ctttaaactacaggcctcactggagctacagacaagaaggtgaaaaacggctgacaaaag 120
1 L * T T G L T G A T D K K V K N G * Q K
2 F K L Q A S L E L Q T R R * K T A D K R
3 L N Y R P H W S Y R Q E G E K R L T K E
121 aagtcctggtatcttctatggtgggagaagaaaactagctaaagggaagaataaattaga 180
1 K S W Y L L W W E K K T S * R E E * I R
2 S P G I F Y G G R R K L A K G K N K L E
3 V L V S S M V G E E N * L K G R I N * R
181 gaaaaattggaatgactgaatcggaacaaggcaaaggctataaaaaaaattaagcagcag 240
1 E K L E * L N R N K A K A I K K I K Q Q
2 K N W N D * I G T R Q R L * K K L S S S
3 K I G M T E S E Q G K G Y K K N * A A V
241 tatcctcttgggggccccttccccacactatctcaatgcaaatatctgtctgaaacggtt 300
1 Y P L G G P F P T L S Q C K Y L S E T V
2 I L L G A P S P H Y L N A N I C L K R F
3 S S W G P L P H T I S M Q I S V * N G S
301 cctggctaaactccacccatgggttggccagccttgccttgaccaatagccttgacaagg 360
1 P G * T P P M G W P A L P * P I A L T R
2 L A K L H P W V G Q P C L D Q * P * Q G
3 W L N S T H G L A S L A L T N S L D K A
361 caaacttgaccaatagtcttagagtatccagtgaggccaggggccggcggctggctaggg 420
1 Q T * P I V L E Y P V R P G A G G W L G
2 K L D Q * S * S I Q * G Q G P A A G * G
3 N L T N S L R V S S E A R G R R L A R D
421 atgaagaataaaaggaagcacccttcagcagttccacacactcgcttctggaacgtctga 480
1 M K N K R K H P S A V P H T R F W N V *
2 * R I K G S T L Q Q F H T L A S G T S E
3 E E * K E A P F S S S T H S L L E R L R
481 ggttatcaataagctcctagtccagacgccatgggtcatttcacagaggaggacaaggct 540
1 G Y Q * A P S P D A M G H F T E E D K A
2 V I N K L L V Q T P W V I S Q R R T R L
3 L S I S S * S R R H G S F H R G G Q G Y
541 actatcacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctggga 600
1 T I T S L W G K V N V E D A G G E T L G
2 L S Q A C G A R * M W K M L E E K P W E
3 Y H K P V G Q G E C G R C W R R N P G K
601 aggtaggctctggtgaccaggacaagggagggaaggaaggaccctgtgcctggcaaaagt 660
1 R * A L V T R T R E G R K D P V P G K S
2 G R L W * P G Q G R E G R T L C L A K V
3 V G S G D Q D K G G K E G P C A W Q K S
661 ccaggtcgcttctcaggatttgtggcaccttctgactgtcaaactgttcttgtcaatctc 720
1 P G R F S G F V A P S D C Q T V L V N L
2 Q V A S Q D L W H L L T V K L F L S I S
3 R S L L R I C G T F * L S N C S C Q S H
721 acaggctcctggttgtctacccatggacccagaggttctttgacagctttggcaacctgt 780
1 T G S W L S T H G P R G S L T A L A T C
2 Q A P G C L P M D P E V L * Q L W Q P V
3 R L L V V Y P W T Q R F F D S F G N L S
781 cctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctga 840
1 P L P L P S W A T P K S R H M A R R C *
2 L C L C H H G Q P Q S Q G T W Q E G A D
3 S A S A I M G N P K V K A H G K K V L T
841 cttccttgggagatgccataaagcacctggatgatctcaagggcacctttgcccagctga 900
1 L P W E M P * S T W M I S R A P L P S *
2 F L G R C H K A P G * S Q G H L C P A E
3 S L G D A I K H L D D L K G T F A Q L S
901 gtgaactgcactgtgacaagctgcatgtggatcctgagaacttcaaggtgagtccaggag 960
1 V N C T V T S C M W I L R T S R * V Q E
2 * T A L * Q A A C G S * E L Q G E S R R
3 E L H C D K L H V D P E N F K V S P G D
961 atgtttcagcactgttgcctttagtctcgaggcaacttagacaactgagtattgatctga 1020
1 M F Q H C C L * S R G N L D N * V L I *
2 C F S T V A F S L E A T * T T E Y * S E
3 V S A L L P L V S R Q L R Q L S I D L S
1021 gcacagcagggtgtgagctgtttgaagatactggggttgggagtgaagaaactgcagagg 1080
1 A Q Q G V S C L K I L G L G V K K L Q R
2 H S R V * A V * R Y W G W E * R N C R G
3 T A G C E L F E D T G V G S E E T A E D
1081 actaactgggctgagacccagtggcaatgttttagggcctaaggagtgcctctgaaaatc 1140
1 T N W A E T Q W Q C F R A * G V P L K I
2 L T G L R P S G N V L G P K E C L * K S
3 * L G * D P V A M F * G L R S A S E N L
1141 tagatggacaactttgactttgagaaaagagaggtggaaatgaggaaaatgacttttctt 1200
1 * M D N F D F E K R E V E M R K M T F L
2 R W T T L T L R K E R W K * G K * L F F
3 D G Q L * L * E K R G G N E E N D F S L
1201 tattagatttcggtagaaagaactttcacctttcccctatttttgttattcgttttaaaa 1260
1 Y * I S V E R T F T F P L F L L F V L K
2 I R F R * K E L S P F P Y F C Y S F * N
3 L D F G R K N F H L S P I F V I R F K T
1261 catctatctggaggcaggacaagtatggtcgttaaaaagatgcaggcagaaggcatatat 1320
1 H L S G G R T S M V V K K M Q A E G I Y
2 I Y L E A G Q V W S L K R C R Q K A Y I
3 S I W R Q D K Y G R * K D A G R R H I L
1321 tggctcagtcaaagtggggaactttggtggccaaacatacattgctaaggctattcctat 1380
1 W L S Q S G E L W W P N I H C * G Y S Y
2 G S V K V G N F G G Q T Y I A K A I P I
3 A Q S K W G T L V A K H T L L R L F L Y
1381 atcagctggacacatataaaatgctgctaatgcttcattacaaacttatatcctttaatt 1440
1 I S W T H I K C C * C F I T N L Y P L I
2 S A G H I * N A A N A S L Q T Y I L * F
3 Q L D T Y K M L L M L H Y K L I S F N S
1441 ccagatgggggcaaagtatgtccaggggtgaggaacaattgaaacatttgggctggagta 1500
1 P D G G K V C P G V R N N * N I W A G V
2 Q M G A K Y V Q G * G T I E T F G L E *
3 R W G Q S M S R G E E Q L K H L G W S R
1501 gattttgaaagtcagctctgtgtgtgtgtgtgtgtgtgtgcgcgcgtgtgtttgtgtgtg 1560
1 D F E S Q L C V C V C V C A R V C L C V
2 I L K V S S V C V C V C V R A C V C V C
3 F * K S A L C V C V C V C A R V F V C V
1561 tgtgagagcgtgtgtttcttttaacgttttcagcctacagcatacagggttcatggtggc 1620
1 C E S V C F F * R F Q P T A Y R V H G G
2 V R A C V S F N V F S L Q H T G F M V A
3 * E R V F L L T F S A Y S I Q G S W W Q
1621 aagaagataacaagatttaaattatggccagtgactagtgctgcaagaagaacaactacc 1680
1 K K I T R F K L W P V T S A A R R T T T
2 R R * Q D L N Y G Q * L V L Q E E Q L P
3 E D N K I * I M A S D * C C K K N N Y L
1681 tgcatttaatgggaaagcaaaatctcaggctttgagggaagttaacataggcttgattct 1740
1 C I * W E S K I S G F E G S * H R L D S
2 A F N G K A K S Q A L R E V N I G L I L
3 H L M G K Q N L R L * G K L T * A * F W
1741 gggtggaagcttggtgtgtagttatctggaggccaggctggagctctcagctcactatgg 1800
1 G W K L G V * L S G G Q A G A L S S L W
2 G G S L V C S Y L E A R L E L S A H Y G
3 V E A W C V V I W R P G W S S Q L T M G
1801 gttcatctttattgtctcctttcatctcaacagctcctgggaaatgtgctggtgaccgtt 1860
1 V H L Y C L L S S Q Q L L G N V L V T V
2 F I F I V S F H L N S S W E M C W * P F
3 S S L L S P F I S T A P G K C A G D R F
1861 ttggcaatccatttcggcaaagaattcacccctgaggtgcaggcttcctggcagaagatg 1920
1 L A I H F G K E F T P E V Q A S W Q K M
2 W Q S I S A K N S P L R C R L P G R R W
3 G N P F R Q R I H P * G A G F L A E D G
1921 gtgactggagtggccagtgccctgtcctccagataccactgagctcactgcccatgatgc 1980
1 V T G V A S A L S S R Y H * A H C P * C
2 * L E W P V P C P P D T T E L T A H D A
3 D W S G Q C P V L Q I P L S S L P M M Q
1981 agagctttcaaggataggctttattctgcaagcaatacaaataataaatctattctgcta 2040
1 R A F K D R L Y S A S N T N N K S I L L
2 E L S R I G F I L Q A I Q I I N L F C *
3 S F Q G * A L F C K Q Y K * * I Y S A K
2041 agagatcacacatggttgtcttcagttcttttttttatgtctttttaaatatatgagcca 2100
1 R D H T W L S S V L F F M S F * I Y E P
2 E I T H G C L Q F F F L C L F K Y M S H
3 R S H M V V F S S F F Y V F L N I * A T
-----------------------------------
Search above for:
>HBgG_pep
MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDL
KGTFAQLSELHCDKLHVDPE NFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVASALSSRYH
------------------------------------------------------------
N.B.B. Above constitutes a Log file and below is summary file
361 caaacttgaccaatagtcttagagtatccagtgaggccaggggccggcggctggctaggg 420
possible promoter region
421 atgaagaataaaaggaagcacccttcagcagttccacacactcgcttctggaacgtctga 480
possible promoter region
481 ggttatcaataagctcctagtccagacgccatgggtcatttcacagaggaggacaaggct 540
1 G Y Q * A P S P D A M G H F T E E D K A exon-1
541 actatcacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctggga 600
1 T I T S L W G K V N V E D A G G E T L G
int/exon(NB -1)
601 aggtaggctctggtgaccaggacaagggagggaaggaaggaccctgtgcctggcaaaagt 660
1 R * A L V T R T R E G R K D P V P G K S
int/exon(NB+1)
721 acaggctcctggttgtctacccatggacccagaggttctttgacagctttggcaacctgt 780
3 R L L V V Y P W T Q R F F D S F G N L S
781 cctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctga 840
3 S A S A I M G N P K V K A H G K K V L T
841 cttccttgggagatgccataaagcacctggatgatctcaagggcacctttgcccagctga 900
3 S L G D A I K H L D D L K G T F A Q L S exon-2
exon/int
901 gtgaactgcactgtgacaagctgcatgtggatcctgagaacttcaaggtgagtccaggag 960
3 E L H C D K L H V D P E N F K V S P G D
int/exon
1801 gttcatctttattgtctcctttcatctcaacagctcctgggaaatgtgctggtgaccgtt 1860
1 V H L Y C L L S S Q Q L L G N V L V T V exon-3
1861 ttggcaatccatttcggcaaagaattcacccctgaggtgcaggcttcctggcagaagatg 1920
1 L A I H F G K E F T P E V Q A S W Q K M
1921 gtgactggagtggccagtgccctgtcctccagataccactgagctcactgcccatgatgc 1980
1 V T G V A S A L S S R Y H * A H C P * C
1981 agagctttcaaggataggctttattctgcaagcaatacaaataataaatctattctgcta 2040
possible polyadenylation site
CONCLUSIONS:
What lessons in strategy can be learned from this experience?
Exonic sequence is defined by nucleotide sequence not peptide- note
phase & frame shift at ex1/int1 and int1/ex2 junctions. Consistent GT/AG intron ends.
Note 'R' at both end of ex1 and start of ex2- consider the bases for selection!
What steps might you employ (& in what order) to efficiently attempt to define a
gene locus in genomic DNA?
Translation, search ORFs for peptide match, check consistency of exon/intron junctions
It is very difficult to identify possible promoter features.
What would you want to know & if I didn't tell you what would you do about it? It helps to start with a small hunk of gDNA that contains the coding region for a
known peptide. The output format of TACG for linear translation showing a direct
relationship between nucleotide and peptide sequence is very useful for manually
identifying the non-0-phase int/exon junctions, the true exon sequence, the consistent
intron ends and resolves multiple choices for repeats of exon terminal amino acids!!
Using the translation output format from other (BCM) site giving each frame in separate
text files and showing no relationship to nucleotide sequence makes it very difficult to
accurately identify the exon/intron boundaries. Also the numbering at TACG makes it
easier to reference the scaffold position of the translated gDNA. If I had to find a gene
in a whole genome (i.e. Drosophila) and had only a human or rat peptide (and nucleotide)
sequence, I would try to identify start points by Blast (x or p) of the gDNA.
The report form (as opposed to log files) need only contain the essentials and can be quite
short and to the point. Bottom line, I have shown that the hbgg gene is completely
described by three exons and I feel confident that I have identified the exon/intron
boundaries correctly and that they consistently exhibit GT-AG ends. When the gDNA hunk
is small and the protein is known and also small, this is an extremely easy activity.
Using the WP & IE as tools help me quickly capture the concepts. Now it would be interesting
to use gene-finding software and see if it is as good as I am and if not what are the
problems.