Data Analysis

DNA sequence

Useful Regular Expression

  1. Sometimes, you want to download sequences that are published in a paper. For example, look at this paper: Ted R. Schultz and Sean G. Brady. 2008. PNAS. 105(14): 5435-5440., "Major evolutionary transitions in ant agriculture"
  2. What you should do to download GenBank is preparing list of accession numbers
    In the supplement information, Table S2, accession numbers are listed.
  3. To obtain these sequence from GenBank, you need to prepare, comma delimited accession numbers
    Accession number 1, Accession number 2, ....
    • To do this, follow the work flow in the text editor
  4. Using Jedit, do search/replace using regular expressio
    • At first, repeated appeared same pattern will be replaced (spaces and dots should be replaced into other characters
      no seq → no_seq
      cf.  → cf_
    • multiple white spaces to something
    • change the file name into
       species name with accession
      by regular replace command with regular expression
  5. To prepare the list of accession numbers, copy the list of accession numbers from the Table S2, and search GenBank with comma delimited accession numbers. Save sequences in FASTA formatted files
  6. The obtained FASTA file can be alignment with clustalw
    But, because of the limitation in the character length of sample name, it's better to edit sample name, at first.