Page Contents

2022-06-18 #r7rs #scheme

A Scheme implementation of the DNA to Protein conversion program given in Python 3 at: https://www.geeksforgeeks.org/dna-protein-python-3/

The program creates two functions: one to read a sequence of text from a file, returning all the contents as one string without line breaks; and a second to convert each DNA triple in the string into a single letter protein.

For the first function, we use read-line, which reads all the text on a line without line breaks, so our read-sequence-file function can append all the lines together to make a complete string.

For the second function, we use an association list for the table: we could use a hash-table or map, but the association list is built-in to R7RS-small and the example is small enough that efficiency does not matter.

Final Program

(import (scheme base)
        (scheme file)
        (scheme read)
        (scheme write))

(define (read-sequence-file filename)
  (with-input-from-file                                     ; 1
    filename
    (lambda ()
      (do ((line (read-line) (read-line))                   ; 2
           (sequence "" (string-append sequence line)))     ; 3
        ((eof-object? line) sequence)))))                   ; 4

(define (translate sequence)
  (let ((table                                              ; 5
          '(("ATA" . #\I) ("ATC" . #\I) ("ATT" . #\I) ("ATG" . #\M)
            ("ACA" . #\T) ("ACC" . #\T) ("ACG" . #\T) ("ACT" . #\T)
            ("AAC" . #\N) ("AAT" . #\N) ("AAA" . #\K) ("AAG" . #\K)
            ("AGC" . #\S) ("AGT" . #\S) ("AGA" . #\R) ("AGG" . #\R)
            ("CTA" . #\L) ("CTC" . #\L) ("CTG" . #\L) ("CTT" . #\L)
            ("CCA" . #\P) ("CCC" . #\P) ("CCG" . #\P) ("CCT" . #\P)
            ("CAC" . #\H) ("CAT" . #\H) ("CAA" . #\Q) ("CAG" . #\Q)
            ("CGA" . #\R) ("CGC" . #\R) ("CGG" . #\R) ("CGT" . #\R)
            ("GTA" . #\V) ("GTC" . #\V) ("GTG" . #\V) ("GTT" . #\V)
            ("GCA" . #\A) ("GCC" . #\A) ("GCG" . #\A) ("GCT" . #\A)
            ("GAC" . #\D) ("GAT" . #\D) ("GAA" . #\E) ("GAG" . #\E)
            ("GGA" . #\G) ("GGC" . #\G) ("GGG" . #\G) ("GGT" . #\G)
            ("TCA" . #\S) ("TCC" . #\S) ("TCG" . #\S) ("TCT" . #\S)
            ("TTC" . #\F) ("TTT" . #\F) ("TTA" . #\L) ("TTG" . #\L)
            ("TAC" . #\Y) ("TAT" . #\Y) ("TAA" . #\_) ("TAG" . #\_)
            ("TGC" . #\C) ("TGT" . #\C) ("TGA" . #\_) ("TGG" . #\W))))
    (do ((i 0 (+ i 3))                                      ; 6
         (result '() (cons (cdr (assoc (substring sequence i (+ i 3))
                                       table))              ; 7
                           result)))
      ((>= i (string-length sequence))                      ; 8
       (list->string (reverse result))))))

(let* ((dna-sequence (read-sequence-file "dna_sequence.txt"))
       (protein-sequence (translate dna-sequence))
       (target-sequence (read-sequence-file "amino_acid_sequence.txt")))
  (display "Comparing translated with target: ")
  (display (equal? protein-sequence target-sequence))
  (newline))
1 Opens the given file as an input port
2 Reads each line from the current input port as a string
3 Joining the strings together in turn
4 …​ until the end of file is reached, when the read sequence is returned.
5 The conversion table is stored in an association list.
6 An index takes us through the string, one triple at a time
7 …​ looking up each codon in turn, and recording the protein letter.
8 When the string is fully processed, turn the sequence of letters into a string to return.