Data Analysis

DNA sequence

Useful Regular Expression

  1. Sometimes, you want to download sequences that are published in a paper. For example, look at this paper: Ted R. Schultz and Sean G. Brady. 2008. PNAS. 105(14): 5435-5440., "Major evolutionary transitions in ant agriculture"
  2. What you should do to download GenBank is preparing list of accession numbers
    In the supplement information, Table S2, accession numbers are listed.
  3. To obtain these sequence from GenBank, you need to prepare, comma delimited accession numbers
    Accession number 1, Accession number 2, ....
    • To do this, follow the work flow in the text editor
  4. Using Notepad++, do search/replace using regular expression
    • At first, repeated appeared same pattern will be replaced (spaces and dots should be replaced into other characters
      no seq → no_seq
      cf.  → cf_
    • multiple white spaces to something tab
       +  → \t
      ↑here is a <white space>
    • Copy all and paste it to Excel, then you will get the list of accession numbers in column
    • Copy a column and paste it to Notepad++
    • replace the end of line to comma
       \n → ,