This is the first installment in a series dealing with events leading up to the stable elongation complex of RNA polymerase, DNA and the growing nascent trancript. After things get going, I will begin assembling and cross-referencing the posts, providing an order for those who want to step in at their level of understanding. The levels will range from beginner to specialized researcher. I will include very basic descriptions primarily to establish a baseline of my assumptions. Introductory essays will usually appear after I have gotten a section going because it is far easier and better to write an introduction after you know exactly what you are introducing.
The initial step in expression of any given gene is RNA polymerase binding to a specific sequence upstream of the protein coding sequence, called a promoter. Transcription typically begins at a single base or a closely bunched set of bases. Promoter sequences are typically defined by the the position of the initiation site(s). Given that a typical bacterial genome is about 4 million bp in length, there are 8 million potential start sites: 4 million in one direction and 4 million in the opposite direction. Yet, RNAP transcribes the genome non-randomly, starting at a limited set of promoters and proceeding from there in a specific direction. These observations lead to the fundamental question: how does RNAP choose position and orientation on the DNA molecule?
Finding a promoter
What distinguishes a promoter site from non-promoter regions? Since DNA is composed of two complementary anti-parallel strands, the sequence of base pairs is the obvious candidate for determining location. To gain a better understanding of what determines a promoter, let’s perform some gedanken (thought) experiments with a well equipped virtual molecular genetics laboratory. Given that we have genetically mapped a region upstream of the rrnB operon with the genes rrsB, rrlB, and rrfB, encoding the 16S, 23S, and 5S subunits of the ribosome. In fast growing E. coli cells, this operon is transcribed at a fast rate, a good a place as any to begin our promoter hunt.
Through the magic of PCR, we obtain a 379 bp fragment from immediately upstream of the rrsB gene. The sequence of this region is
5'-TGATTTGGTTGAATGTTGCGCGGTCAGAAAATTATTTTAAATTTCCTCTT GTCAGGCCGGAATAACTCCCTATAATGCGCCACCACTGACACGGAACAAC GGCAAACACGCCGCCGGGTCAGCGGGGTTCTCCTGAGAACTCCGGCAGAG AAAGCAAAAATAAATGCTTGACTCTGTAGCGGGAAGGCGTATTATGCACA CCCCGCGCCGCTGAGAAAAAGCGAAGCGGCACTGCTCTTTAACAATTTAT CAGACAATCTGTGTGGGCACTCGAAGATACGGATTCTTAACGTCGCAAGA CGAAAAATGAATACCAAGTCTCAAGAGTGAACACGTAATTCATTACGAAG TTTAATTCTTTGAGCGTCAAACTTTTAAA-3'
I show only the sequence of only one strand since the sequence of the complementary strand can be deduced by the base pair rules. It is conventional to write all nucleic acid sequences in the 5′ → 3′ direction, leaving out the designation of 5′ & 3′ ends. Unfortunately, there is no universal convention for naming the strands. I will refer to the strand that runs in the same 5′ → 3′ direction as the direction of transcription as the coding strand. This strand has the same sequence of the RNA transcript, except T is found in the DNA with U in the RNA. Thus, this sequence can be immediately used to determine potential protein sequences by using the genetic code. I will refer to the complementary strand as the template strand since it provides the DNA template for RNA synthesis.
Just looking at this string of bases provides no immediate clue as to where the promoter may be. Before the development of recombinant DNA technology, geneticists defined the promoter through fine structure mapping and characterization of induced and spontaneous mutations. These mutations were subsequently used to identify positions that are necessary for proper promoter function once DNA sequencing techniques were developed. For illustrative purposes, I will take a more direct biochemical approach, utilizing my well stocked virtual laboratory.
One simple biochemical experiment is to use this DNA as a template in an in vitro reaction to direct transcription by purified RNA polymerase. The premise here is that RNAP will initiate transcription at the same sites as in vivo and will transcribe until it has reached the end of the template. RNAP then falls off releasing an RNA product with a precise 3′ end that matches the end of the DNA fragment. These experiments are called runoff experiments and are useful for other investigations beyond promoter mapping.
The experimental procedure is straightforward: we mix purified RNAP, DNA, NTP (ATP, CTP, GTP, UTP) along with appropriate buffer and salt condition and incubate for a few minutes. We then can analyze the RNA products in several ways. As a preliminary step, we determine the length(s) of the RNA products and find two sizes of RNA: 176 nt & 295 nt in length. While we are happy that we obtained RNA products, but the finding of two transcripts complicates matters. We have several possibilities:
- Two promoters directing transcription in the (+) direction towards the rrsB gene.
- Two promoters directing transcription in the opposite (-) direction.
- One promoter for the (+) and another promoter for the (-) direction.
- A slew of other possibilities that need only concern us if we were doing a real experiment.
To distinguish between these possibilities, we will use the fact that the RNA produced is single stranded and complementary to one of the DNA strands. There are a number of procedures available, but here we will employ a method using reverse transcriptase (RT). This enzyme uses RNA as a template to synthesize DNA. In addition to the RNA template, RT requires a DNA bound to the RNA template with a free 3′-OH, called a primer, to which new bases are added in a template specific manner (almost all DNA polymerases have this requirement). We exploit this requirement by providing our own synthesized DNA primers: a (+)-primer that is complementary to RNA transcribed in the (+) direction (5′-TTTAAAAGTTTGACGCTCAA) and a (-)-primer that is complementary to RNA transcribed in the (-) direction (5’TGATTTGGTTGAATGTTGCG). Although we can chose any sequence for our primers, I have chosen primers that correspond to the ends of our DNA fragment and would then be complementary to the 3′ ends of the runoff products.
The procedure is similar to that for the runoff assay: the individual primer is mixed with the RNA products under conditions that allow the primer to pair with the complementary RNA sequence. We then add RT, dNTP (dATP, dCTP, dGTP, dTTP) to the RNA/primer hybrids and incubate for a period of time. RT will extend the primers and synthesize DNA until it runs out of RNA template. As in the case of our original runoff transcription assay, the length of the DNA products is determined by the position of the 5′ end of the RNA product. We then determine the length(s) of the DNA products: 176 nt & 295 nt DNA products were synthesized only when the primer complementary to RNA transcribed in the (+) direction. This procedure is called a primer extension assay. Since we know the lengths of the transcripts produced and the lengths of the primer extensions, we can identify the bases at which transcription initiated. This is often referred to as the +1 site. The sequences below show the position of the first base in the respective transcripts, highlighted in red:
295: CCTCTTGTCAGGCCGGAATAACTCCCTATAATGCGCCACCACTGACACGGA 176: ATGCTTGACTCTGTAGCGGGAAGGCGTATTATGCACACCCCGCGCCGCTGA
The longer transcript identifies a promoter called rrnBp1 and the second, rrnBp2. This sort of procedure has been done with these and many other promoters. By examining these sequences, along with sequences of mutant promoters, two important regions for promoter function were identified: the -10 and -35 regions, named for their position relative to the +1 site. Many textbooks report a “consensus” sequence for these two regions, -35: TTGACA & -10: TATAAT. The consensus is the sequence of bases that represent the most common base at each position. Though the idea of a consensus sequence was useful in understanding specific protein-DNA binding, it lacks subtle, important information required for a better description of a DNA binding site. A more sophisticated approach was designed by Tom Schneider and others based on the concepts of information theory developed by Claude Shannon (this will be the subject of other posts).
In this methodology, promoter regions from different genes and from different organisms can be aligned using sequence alignment programs. The frequency of occurrence of the bases at each position is used to determine the quantity of information each position contributes, measured in bits. Binary signals carry a maximum of 1 bit of information per position (0 or 1), but DNA can carry 2 bits of information per position (A, C, G, or T). By multiplying the fraction of each base found at a position by the total information content for that position, we can get a measure of the information that each base contributes. Joining all of the positions sequentially yields a matrix of the information contribution of each base across a site. A visually pleasing and informative display this matrix, called a sequence logo, is a stacked bar chart with the height of base letters corresponding to their factional information contribution in bits. To illustrate this visualization, I have aligned the promoter regions upstream of the ribosomal genes in 12 different enteric bacteria. Each strain has ~7 copies of the ribosomal operons providing 83 different sequences. The sequence logo for the rrnBp1 promoter is shown in the first figure.
Figure 1. Sequence logo for the rrn-p1 promoter.
The textbook “consensus” for the -10 region, TATAAT, is present in all of the aligned p1 promoters, but only the TTG of the -35 region is completely conserved. The essential concept behind this analysis is that if a specific base pair at a position required for a function (e.g., binding of RNAP), then mutations at that position will be lost over time due to selection, tending towards 2 bits of information specified by a single base pair. However, if a specific base pair is not required for any function, mutations will not be selected against and the base frequencies for that site will tend towards 0.25 and the information content for that site will tend towards 0 bits. Between these two extremes are sites which have less than 2 bits but well above 0 bits bits of information. The thought is that these base pair positions contribute to some function, but there is some flexibility. With this insight, the high information content of GC base pairs bracketing the -10 region suggest that they are required for some function(s) and the regions upstream of the -35 region participate in other functions. (The functions of these regions will be discussed at a later date).
Figure 2. Sequence logo for the rrn-p2 promoter
The information content for the rrn-p2 promoters has some similarities to the p1 promoters but there are several differences. Generally, differences in conserved sequences between promoters suggest different functional interactions with RNA polymerase or with different transcription factor proteins. Indeed, the use of these promoters varies depending upon growth conditions: initiation at p1 predominates in fast growing cells and conversely p2 predominates at lower growth rates. Since setting the proper rate of synthesis of ribosomal RNA is crucial for proper cell function, these promoters and the various factors involved in their regulation have been an active area of study since the mid-1960s.
The physical meaning of information content
The rate of progress in our understanding of biological processes has been astounding. Advancements in sequencing technology together with more sophisticated search algorithms powered by faster computers have provided biologists with an amazing tool kit for exploration. However, information provided by bioinformatics requires wet lab experiments to elucidate the functions of evolutionarily conserved sequences, be they nucleic acid or protein sequences. Conversely, classic genetic and biochemical studies provide the bioinformatics community with ideas for new or more refined searches. Complementing these studies is the torrent of structure data from x-ray crystalography and NMR projects.
These advances have taken molecular biology from being a somewhat abstract field of research to a more concrete one. We no longer have to imagine the interaction between one protein and another or with DNA: there is usually a structure in the database of a protein that has sequence similarities to one that you are studying. This leads to better designed experiments as well as clearer insights as to the mechanisms of different biological processes. It is in this light, that I turn to structure in order to place sequence information into perspective.
While our human brains can clearly recognize pattern in the sequence of letters, RNA polymerase has no eyes and must search for a promoter by interacting with the DNA and testing each site to determine whether it is a promoter or not. What then is RNA polymerase’s search algorithm? What does it mean when we say RNAP recognizes a promoter?
Let’s now look at the iconic DNA double helix with these thoughts in mind. We will again use the rrn p1 and p2 promoters as an example. In Fig. 3, I have generated generic DNA double helices for both p1 and p2. With our eyes, its hard to say that the two are different and only with the animated gif switching between the two can we begin to discern difference.
Figure 3. Animated gif switching between rrnBp1 and rrnBp2 promoters.
One aspect that is obvious is that the sugar phosphate backbone does not change in this generic representation. In real life, this generic structure is bent, twisted, stretched and sheared due to the underlying sequence and stresses imposed by binding proteins and helical supercoiling. Nevertheless, as a first approximation the differences can be found between base pairs in the accessible regions between the sugar-phosphate backbone. Due to the geometry of base pair structure, the two angles between the backbone differ, yielding a wider, major groove and a narrower, minor groove. Most specific protein DNA is determined by hydrogen-bonding in the major groove. Each base pair presents a unique array of H-bond donors and acceptors to the surface of the major groove. It should be noted that while some H-bond positions are unique there are others that are not.
To better visualize the array of H-bond donors and acceptors, I have untwisted the helix and have laid it flat with the surface of the major groove shown (Fig. 4). Again the animated gif cycles between rrnBp1 and rrnBp2 promoter sequences.
Figure 4. Animated gif of a flattened DNA showing the bases of the major groove switching between rrnBp1 and rrnBp2 promoters.
The closed complex
Now that we have a decent idea of what a promoter looks like, let’s turn to the question of how RNA polymerase recognizes a promoter. RNAP is a multimeric protein with 5 subunits that make up the core enzyme (E): 2 α, 1 β, 1 β’ & 1 ω. The core enzyme catalyzes the synthesis of RNA from DNA, but under normal conditions it is unable to initiate transcription on its own. For initiation, the core polymerase must bind to a σ-factor, the resulting complex is called a holo-enzyme (Eσ). There are several different σ factors in E. coli, and each direct the core enzyme to initiate transcription at different sets of promoters. This implies that one or more domains of the sigma factor make direct contact with the promoter DNA. Additionally, the specificity of binding implies that the holo-enzyme binds more tightly to promoter DNA than to non-specific DNA, and this differential binding forms the basis of the search.
Biochemically, the search can be described by the following reaction set. For any DNA of length n, an equilibrium binding constant characterizes the affinity of RNA polymerase to that DNA sequence at position i:
For a promoter at position p, Ka,p >> Ka,i. Thus, we can envision the search as a series of binding/release steps along the length of the DNA. The Eσ-promoter complex being the most stable. This complex is called the closed complex since the DNA helix remains intact. In a subsequent step in the initiation process, the DNA helix surrounding the +1 site is pulled apart or melted, forming an open complex.
X-ray crystalography and related techniques of protein-DNA complexes have allowed us to visualize the Eσ-DNA interactions. In Fig. 5, I provide a view of the DNA structure surrounding the -35 region obtained by x-ray crystalography (the structure data was obtained from the PDB database). Here the base pairs for the -35 region, TTGACA, are colored yellow, the DNA backbone in dark blue, and the adjacent DNA bases in green.
Figure 5. -35 region DNA.
Adding the structure of region 4 of σ-70 (orange) shows that this domain of the protein binds in the major groove of the -35 region (Fig. 6).
Figure 6. σ70 region 4 binding to -35 DNA.
We need to remember is that σ region 4 is only a small part of σ70 and that σ70 is one part of the larger holo-enzyme. To gain a better perspective, let us look at the structure of a Eσ bound to promoter DNA, Because of the size of the complex, crystals are difficult to grow and the resolution is not as good as for smaller molecules. Nevertheless, the lab of Seth Darst has published a structure of holo-enzyme bound to a promoter DNA. This is not precisely a closed complex, but it is close enough for our purposes here. Figure 7 shows the structure of the protein and DNA backbones. Here the -35 as well as the -10 regions are colored in yellow and the other sequences are colored in dark blue. The protein backbone of the core enzyme is blue-green (teal?) and σ70 is again orange.
Figure 7. Holo-enzyme – promoter complex.
There are a few things to observe. At the upper left of the figure we can see the binding of σ-region 4 to the -35 element. To the upper right, we can see σ70 making close contact with the -10 element and adjacent bases. If we were to trace out the major groove, we would find that the -35 & -10 elements would lie on the same face of the DNA helix. Also note that in this structure, σ-region 4 does not appear to be tightly associated with the adjacent core enzyme, implying some flexibility. Indeed, early experiments demonstrated that the distance between the -35 and -10 elements is flexible: varying from 16 to 18 bp for different promoters. One important point to notice is the relative difference in size between Eσ and the DNA. This size difference will become important later when we begin to address specific questions related to the mechanism of searching.