Complementarity (molecular biology)

Match up between two DNA bases (guanine and cytosine) showing hydrogen bonds (dashed lines) holding them together
Match up between two DNA bases (adenine and thymine) showing hydrogen bonds (dashed lines) holding them together

In molecular biology, complementarity describes a relationship between two structures each following the lock-and-key principle. In nature complementarity is the base principle of DNA replication and transcription as it is a property shared between two DNA or RNA sequences, such that when they are aligned antiparallel to each other, the nucleotide bases at each position in the sequences will be complementary, much like looking in the mirror and seeing the reverse of things. This complementary base pairing allows cells to copy information from one generation to another and even find and repair damage to the information stored in the sequences.

The degree of complementarity between two nucleic acid strands may vary, from complete complementarity (each nucleotide is across from its opposite) to no complementarity (each nucleotide is not across from its opposite) and determines the stability of the sequences to be together. Furthermore, various DNA repair functions as well as regulatory functions are based on base pair complementarity. In biotechnology, the principle of base pair complementarity allows the generation of DNA hybrids between RNA and DNA, and opens the door to modern tools such as cDNA libraries. While most complementarity is seen between two separate strings of DNA or RNA, it is also possible for a sequence to have internal complementarity resulting in the sequence binding to itself in a folded configuration.

DNA and RNA base pair complementarity

Complementarity between two antiparallel strands of DNA. The top strand goes from the left to the right and the lower strand goes from the right to the left lining them up.
Left: the nucleotide base pairs that can form in double-stranded DNA. Between A and T there are two hydrogen bonds, while there are three between C and G. Right: two complementary strands of DNA.

Complementarity is achieved by distinct interactions between nucleobases: adenine, thymine (uracil in RNA), guanine and cytosine. Adenine and guanine are purines, while thymine, cytosine and uracil are pyrimidines. Purines are larger than pyrimidines. Both types of molecules complement each other and can only base pair with the opposing type of nucleobase. In nucleic acid, nucleobases are held together by hydrogen bonding, which only works efficiently between adenine and thymine and between guanine and cytosine. The base complement A = T shares two hydrogen bonds, while the base pair G ≡ C has three hydrogen bonds. All other configurations between nucleobases would hinder double helix formation. DNA strands are oriented in opposite directions, they are said to be antiparallel.[1]

Nucleic Acid Nucleobases Base complement
DNA adenine(A), thymine(T), guanine(G), cytosine(C) A = T, G ≡ C
RNA adenine(A), uracil(U), guanine(G), cytosine(C) A = U, G ≡ C

A complementary strand of DNA or RNA may be constructed based on nucleobase complementarity.[2] Each base pair, A = T vs. G ≡ C, takes up roughly the same space, thereby enabling a twisted DNA double helix formation without any spatial distortions. Hydrogen bonding between the nucleobases also stabilizes the DNA double helix.[3]

Complementarity of DNA strands in a double helix make it possible to use one strand as a template to construct the other. This principle plays an important role in DNA replication, setting the foundation of heredity by explaining how genetic information can be passed down to the next generation. Complementarity is also utilized in DNA transcription, which generates an RNA strand from a DNA template.[4] In addition, human immunodeficiency virus, a single-stranded RNA virus, encodes an RNA-dependent DNA polymerase (reverse transcriptase) that uses complementarity to catalyze genome replication. The reverse transcriptase can switch between two parental RNA genomes by copy-choice recombination during replication.[5]

DNA repair mechanisms such as proof reading are complementarity based and allow for error correction during DNA replication by removing mismatched nucleobases.[1] In general, damages in one strand of DNA can be repaired by removal of the damaged section and its replacement by using complementarity to copy information from the other strand, as occurs in the processes of mismatch repair, nucleotide excision repair and base excision repair.[6]

Nucleic acids strands may also form hybrids in which single stranded DNA may readily anneal with complementary DNA or RNA. This principle is the basis of commonly performed laboratory techniques such as the polymerase chain reaction, PCR.[1]

Two strands of complementary sequence are referred to as sense and anti-sense. The sense strand is, generally, the transcribed sequence of DNA or the RNA that was generated in transcription, while the anti-sense strand is the strand that is complementary to the sense sequence.

Self-complementarity and hairpin loops

A sequence of RNA that has internal complementarity which results in it folding into a hairpin

Self-complementarity refers to the fact that a sequence of DNA or RNA may fold back on itself, creating a double-strand like structure. Depending on how close together the parts of the sequence are that are self-complementary, the strand may form hairpin loops, junctions, bulges or internal loops.[1] RNA is more likely to form these kinds of structures due to base pair binding not seen in DNA, such as guanine binding with uracil.[1]

A sequence of RNA showing hairpins (far right and far upper left), and internal loops (lower left structure)

Regulatory functions

Complementarity can be found between short nucleic acid stretches and a coding region or a transcribed gene, and results in base pairing. These short nucleic acid sequences are commonly found in nature and have regulatory functions such as gene silencing.[1]

Antisense transcripts

Antisense transcripts are stretches of non coding mRNA that are complementary to the coding sequence.[7] Genome wide studies have shown that RNA antisense transcripts occur commonly within nature. They are generally believed to increase the coding potential of the genetic code and add an overall layer of complexity to gene regulation. So far, it is known that 40% of the human genome is transcribed in both directions, underlining the potential significance of reverse transcription.[8] It has been suggested that complementary regions between sense and antisense transcripts would allow generation of double stranded RNA hybrids, which may play an important role in gene regulation. For example, hypoxia-induced factor 1α mRNA and β-secretase mRNA are transcribed bidirectionally, and it has been shown that the antisense transcript acts as a stabilizer to the sense script.[9]

miRNAs and siRNAs

Formation and function of miRNAs in a cell

miRNAs, microRNA, are short RNA sequences that are complementary to regions of a transcribed gene and have regulatory functions. Current research indicates that circulating miRNA may be utilized as novel biomarkers, hence show promising evidence to be utilized in disease diagnostics.[10] MiRNAs are formed from longer sequences of RNA that are cut free by a Dicer enzyme from an RNA sequence that is from a regulator gene. These short strands bind to a RISC complex. They match up with sequences in the upstream region of a transcribed gene due to their complementarity to act as a silencer for the gene in three ways. One is by preventing a ribosome from binding and initiating translation. Two is by degrading the mRNA that the complex has bound to. And three is by providing a new double-stranded RNA (dsRNA) sequence that Dicer can act upon to create more miRNA to find and degrade more copies of the gene. Small interfering RNAs (siRNAs) are similar in function to miRNAs; they come from other sources of RNA, but serve a similar purpose to miRNAs.[1] Given their short length, the rules for complementarity means that they can still be very discriminating in their targets of choice. Given that there are four choices for each base in the strand and a 20bp - 22bp length for a mi/siRNA, that leads to more than 1×1012 possible combinations. Given that the human genome is ~3.1 billion bases in length,[11] this means that each miRNA should only find a match once in the entire human genome by accident.

Kissing hairpins

Kissing hairpins are formed when a single strand of nucleic acid complements with itself creating loops of RNA in the form of a hairpin.[12] When two hairpins come into contact with each other in vivo, the complementary bases of the two strands form up and begin to unwind the hairpins until a double-stranded RNA (dsRNA) complex is formed or the complex unwinds back to two separate strands due to mismatches in the hairpins. The secondary structure of the hairpin prior to kissing allows for a stable structure with a relatively fixed change in energy.[13] The purpose of these structures is a balancing of stability of the hairpin loop vs binding strength with a complementary strand. Too strong an initial binding to a bad location and the strands will not unwind quickly enough; too weak an initial binding and the strands will never fully form the desired complex. These hairpin structures allow for the exposure of enough bases to provide a strong enough check on the initial binding and a weak enough internal binding to allow the unfolding once a favorable match has been found.[13]

---C G---
   C G                 ---C G---
   U A                    C G 
   G C                    U A
   C G                    G C
   A G                    C G
  A   A                   A G
   C U                   A   A
    U                     CUU              ---CCUGCAACUUAGGCAGG---
    A                     GAA              ---GGACGUUGAAUCCGUCC---
   G A                   U   U
  U   U                  U   C
   U C                    G C
   G C                    C G
   C G                    A U
   A U                    G C
   G C                 ---G C---
---G C---
Kissing hairpins meeting up at the top of the loops. The complementarity 
of the two heads encourages the hairpin to unfold and straighten out to
become one flat sequence of two strands rather than two hairpins.

Bioinformatics

Complementarity allows information found in DNA or RNA to be stored in a single strand. The complementing strand can be determined from the template and vice versa as in cDNA libraries. This also allows for analysis, like comparing the sequences of two different species. Shorthands have been developed for writing down sequences when there are mismatches (ambiguity codes) or to speed up how to read the opposite sequence in the complement (ambigrams).

cDNA Library

A cDNA library is a collection of expressed DNA genes that are seen as a useful reference tool in gene identification and cloning processes. cDNA libraries are constructed from mRNA using RNA-dependent DNA polymerase reverse transcriptase (RT), which transcribes an mRNA template into DNA. Therefore, a cDNA library can only contain inserts that are meant to be transcribed into mRNA. This process relies on the principle of DNA/RNA complementarity. The end product of the libraries is double stranded DNA, which may be inserted into plasmids. Hence, cDNA libraries are a powerful tool in modern research.[1][14]

Ambiguity codes

When writing sequences for systematic biology it may be necessary to have IUPAC codes that mean "any of the two" or "any of the three". The IUPAC code R (any purine) is complementary to Y (any pyrimidine) and M (amino) to K (keto). W (weak) and S (strong) are usually not swapped[15] but have been swapped in the past by some tools.[16] W and S denote "weak" and "strong", respectively, and indicate a number of the hydrogen bonds that a nucleotide uses to pair with its complementing partner. A partner uses the same number of the bonds to make a complementing pair.[17]

An IUPAC code that specifically excludes one of the three nucleotides can be complementary to an IUPAC code that excludes the complementary nucleotide. For instance, V (A, C or G - "not T") can be complementary to B (C, G or T - "not A").

Symbol[18] Description Bases represented
A adenine A 1
C cytosine C
G guanine G
T thymine T
U uracil U
W weak A T 2
S strong C G
M amino A C
K keto G T
R purine A G
Y pyrimidine C T
B not A (B comes after A) C G T 3
D not C (D comes after C) A G T
H not G (H comes after G) A C T
V not T (V comes after T and U) A C G
N or - any base (not a gap) A C G T 4

Ambigrams

Specific characters may be used to create a suitable (ambigraphic) nucleic acid notation for complementary bases (i.e. guanine = b, cytosine = q, adenine = n, and thymine = u), which makes it is possible to complement entire DNA sequences by simply rotating the text "upside down".[19] For instance, with the previous alphabet, buqn (GTCA) would read as ubnq (TGAC, reverse complement) if turned upside down.

qqubqnnquunbbqnbb
bbnqbuubnnuqqbuqq

Ambigraphic notations readily visualize complementary nucleic acid stretches such as palindromic sequences.[20] This feature is enhanced when utilizing custom fonts or symbols rather than ordinary ASCII or even Unicode characters.[20]

See also

References

  1. ^ a b c d e f g h Watson, James, Cold Spring Harbor Laboratory, Tania A. Baker, Massachusetts Institute of Technology, Stephen P. Bell, Massachusetts Institute of Technology, Alexander Gann, Cold Spring Harbor Laboratory, Michael Levine, University of California, Berkeley, Richard Losik, Harvard University ; with Stephen C. Harrison, Harvard Medical (2014). Molecular biology of the gene (Seventh ed.). Boston: Benjamin-Cummings Publishing Company. ISBN 978-0-32176243-6.{{cite book}}: CS1 maint: multiple names: authors list (link)
  2. ^ Pray, Leslie (2008). "Discovery of DNA structure and function: Watson and Crick". Nature Education. 1 (1): 100. Retrieved 27 November 2013.
  3. ^ Shankar, A; Jagota, A; Mittal, J (Oct 11, 2012). "DNA base dimers are stabilized by hydrogen-bonding interactions including non-Watson-Crick pairing near graphite surfaces". The Journal of Physical Chemistry B. 116 (40): 12088–94. doi:10.1021/jp304260t. PMID 22967176.
  4. ^ Hood, L; Galas, D (Jan 23, 2003). "The digital code of DNA". Nature. 421 (6921): 444–8. Bibcode:2003Natur.421..444H. doi:10.1038/nature01410. PMID 12540920.
  5. ^ Rawson JMO, Nikolaitchik OA, Keele BF, Pathak VK, Hu WS. Recombination is required for efficient HIV-1 replication and the maintenance of viral genome integrity. Nucleic Acids Res. 2018;46(20):10535-10545. DOI:10.1093/nar/gky910 PMID 30307534
  6. ^ Fleck O, Nielsen O. DNA repair. J Cell Sci. 2004;117(Pt 4):515-517. DOI:10.1242/jcs.00952
  7. ^ He, Y; Vogelstein, B; Velculescu, VE; Papadopoulos, N; Kinzler, KW (Dec 19, 2008). "The antisense transcriptomes of human cells". Science. 322 (5909): 1855–7. Bibcode:2008Sci...322.1855H. doi:10.1126/science.1163853. PMC 2824178. PMID 19056939.
  8. ^ Katayama, S; Tomaru, Y; Kasukawa, T; Waki, K; Nakanishi, M; Nakamura, M; Nishida, H; Yap, CC; Suzuki, M; Kawai, J; Suzuki, H; Carninci, P; Hayashizaki, Y; Wells, C; Frith, M; Ravasi, T; Pang, KC; Hallinan, J; Mattick, J; Hume, DA; Lipovich, L; Batalov, S; Engström, PG; Mizuno, Y; Faghihi, MA; Sandelin, A; Chalk, AM; Mottagui-Tabar, S; Liang, Z; Lenhard, B; Wahlestedt, C; RIKEN Genome Exploration Research Group; Genome Science Group (Genome Network Project Core Group); FANTOM Consortium (Sep 2, 2005). "Antisense transcription in the mammalian transcriptome". Science. 309 (5740): 1564–6. Bibcode:2005Sci...309.1564R. doi:10.1126/science.1112009. PMID 16141073. S2CID 34559885.
  9. ^ Faghihi, MA; Zhang, M; Huang, J; Modarresi, F; Van der Brug, MP; Nalls, MA; Cookson, MR; St-Laurent G, 3rd; Wahlestedt, C (2010). "Evidence for natural antisense transcript-mediated inhibition of microRNA function". Genome Biology. 11 (5): R56. doi:10.1186/gb-2010-11-5-r56. PMC 2898074. PMID 20507594.{{cite journal}}: CS1 maint: numeric names: authors list (link)
  10. ^ Kosaka, N; Yoshioka, Y; Hagiwara, K; Tominaga, N; Katsuda, T; Ochiya, T (Sep 5, 2013). "Trash or Treasure: extracellular microRNAs and cell-to-cell communication". Frontiers in Genetics. 4: 173. doi:10.3389/fgene.2013.00173. PMC 3763217. PMID 24046777.
  11. ^ "Ensembl genome browser 73: Homo sapiens - Assembly and Genebuild". Ensembl.org. Archived from the original on 15 February 2013. Retrieved 27 November 2013.
  12. ^ Marino, JP; Gregorian RS Jr; Csankovszki, G; Crothers, DM (Jun 9, 1995). "Bent helix formation between RNA hairpins with complementary loops". Science. 268 (5216): 1448–54. Bibcode:1995Sci...268.1448M. doi:10.1126/science.7539549. PMID 7539549.
  13. ^ a b Chang, KY; Tinoco I Jr (May 30, 1997). "The structure of an RNA "kissing" hairpin complex of the HIV TAR hairpin loop and its complement". Journal of Molecular Biology. 269 (1): 52–66. doi:10.1006/jmbi.1997.1021. PMID 9193000.
  14. ^ Wan, KH; Yu, C; George, RA; Carlson, JW; Hoskins, RA; Svirskas, R; Stapleton, M; Celniker, SE (2006). "High-throughput plasmid cDNA library screening". Nature Protocols. 1 (2): 624–32. doi:10.1038/nprot.2006.90. PMID 17406289. S2CID 205463694.
  15. ^ Jeremiah Faith (2011), conversion table
  16. ^ arep.med.harvard.edu A tool page with the note about the applied W-S conversion patch.
  17. ^ Reverse-complement tool page with documented IUPAC code conversion, source code available.
  18. ^ Nomenclature Committee of the International Union of Biochemistry (NC-IUB) (1984). "Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences". Retrieved 2008-02-04.
  19. ^ Rozak DA (2006). "The practical and pedagogical advantages of an ambigraphic nucleic acid notation". Nucleosides Nucleotides Nucleic Acids. 25 (7): 807–13. doi:10.1080/15257770600726109. PMID 16898419. S2CID 23600737.
  20. ^ a b Rozak, DA; Rozak, AJ (May 2008). "Simplicity, function, and legibility in an enhanced ambigraphic nucleic acid notation". BioTechniques. 44 (6): 811–3. doi:10.2144/000112727. PMID 18476835.