授業/H20/系統解析論/演習2 のバックアップ差分(No.4)

バックアップ一覧
現在との差分を表示
ソースを表示
バックアップを表示
授業/H20/系統解析論/演習2 へ行く。
- 1 (2009-01-20 (火) 08:32:03)
- 2 (2009-01-20 (火) 08:52:14)
- 3 (2009-01-20 (火) 13:40:02)
- 4 (2009-01-21 (水) 09:43:28)
- 5 (2009-01-21 (水) 16:28:11)
- 6 (2009-01-22 (木) 10:43:47)
追加された行はこの色です。
削除された行はこの色です。
*&color(,yellow){このページは編集中です}; [#hf6160ea]
*演習2:Unixライク環境での系統解析：実際のデータを用いて [#p1e5cf33]
この演習では実際の研究論文を参考にして、
+DNAデータベースからの塩基配列データのダウンロード
+ダウンロードした塩基配列データのアラインメント



**Windows用の解析ソフトウェア [#rbc9a738]
**ソフトウェアの準備：Windows用の解析ソフトウェア [#rbc9a738]
以下のソフトウェアは、Windowsのグラフィカルインターフェースで使用でき、Cygwinと組み合わせて使うと便利なものばかりです。この演習で使うので、ダウンロードしてインストールしておきましょう。
-MESQUITE http://mesquiteproject.org/mesquite/mesquite.html
--JRE5 Mesquiteに必要な、Java 環境　http://java.sun.com/javase/downloads/index_jdk5.jsp~
Java Runtime Environment (JRE) 5.0 （2009年1月時点ではUpdate 17）をダウンロードしてインストール。
-Bioedit http://www.mbio.ncsu.edu/BioEdit/bioedit.html

**論文の選択 [#bb8e3860]
系統解析の方法を学ぶ良い方法の１つは、実際の研究論文で使われているデータを使って、その論文に書かれている通りの方法で解析を行い、論文の結果通りのものが得られるかどうかをためしてみることです。このことは、自分で新たにシーケンスを決定したサンプルが、すでに発表されている系統樹では、どこに位置するのかを確かめるのにも使えます。~
この演習では、Ted R. Schultz and Sean G. Brady. 2008. [[PNAS. 105(14): 5435-5440.>http://www.pnas.org/content/105/14/5435.full]], "Major evolutionary transitions in ant agriculture"に発表されているデータを用いて、実際にGenBankから配列データをダウンロードし、系統解析を行ってみましょう。~
（分岐年代推定を行っている最近の論文ということで選んだだけなので、各自の好みで他の論文を用いても構いません）

**塩基配列データの準備： accession番号を用いてGenBankからダウンロード [#w378e38b]
上で選んだ論文を見てみると、約90サンプル、４領域合計約2,500bpの配列データを用いた解析を行っています。この論文の[[supplement information, Table S2>http://www.pnas.org/content/suppl/2008/03/24/0711024105.DCSupplemental/0711024105SI.pdf#nameddest=ST1]]には、GenBankに登録された個々の遺伝子領域のaccession番号が載っています。~
やりたい解析は、用いた全ての配列を連結した 約2,500bpの長さの配列データを用いた解析ですから、ダウンロードした配列を%%%連結する%%%作業が必要になります。~
ここで、データを準備する手順を確認しておくと、
+論文([[Table S2>http://www.pnas.org/content/suppl/2008/03/24/0711024105.DCSupplemental/0711024105SI.pdf#nameddest=ST1]]からaccession番号を遺伝子領域ごとに取得
+[[GenBank>http://www.ncbi.nlm.nih.gov/Genbank/index.html]]からデータダウンロード
+それぞれの領域ごとにアラインメントデータを作成
+全部のアラインメントデータを連結

**Major evolutionary transitions in ant agriculture [#ma7c1896]
-Ted R. Schultz and Sean G. Brady. 2008~
[[PNAS. 105(14): 5435-5440.>http://www.pnas.org/content/105/14/5435.full]]
ということを行います。今回の場合、最終的には%%%複数領域のデータを連結%%%した配列を得ようとしているので、ちょっとした工夫が必要になります。
***accession番号を遺伝子領域ごとに取得 [#bf04d98e]
+[[Table S2>http://www.pnas.org/content/suppl/2008/03/24/0711024105.DCSupplemental/0711024105SI.pdf#nameddest=ST1]]からデータをコピー
+K2Editor等のテキストエディタにペーストして、タブ区切りテキストに正規表現検索置換
--[[Table S2>http://www.pnas.org/content/suppl/2008/03/24/0711024105.DCSupplemental/0711024105SI.pdf#nameddest=ST1]]をみると、それぞれのカラムはスペースで区切られている。ただ、"no seq", " sp."とか、カラムの区切り以外でスペースが使われているところもある。
--まず、カラム以外で何カ所も同じパターンでスペースが使われているものを、アンダースコアに一括置換
 no seq → no_seq
 cf.  → cf_
--つづいて、1文字以上の連続する空白全てをタブに正規表現検索置換
  　+ → ¥t
 上の+の左には半角スペースが１つ入っている
+Excelにペースト。余分なスペースのせいでカラムがずれているところを手作業で修正
+遺伝子領域のカラムを選択して、テキストエディタにペースト
--改行をカンマに一括置換、　",no_seq"を削除
 ¥n → ,
 ,no_seq → ＜何も入力しない＞
+遺伝子領域ごとにaccession番号が得られた：
---EF1aF1_e1: EU204345, EU204298, EU204378, EU204363, EU204364, EU204331, EU204360, EU204361, EU204323, EU204377, EU204348, EU204317, EU204347, EU204374, EU204349, EU204334, EU204318, EU204367, EU204335, EU204350, EU204314, EU204324, EU204359, EU204379, EU204315, EU204380, EU204313, EU204299, EU204328, EU204355, EU204366, EU204321, EU204320, EU204330, EU204342, EU204369, EU204354, EU204368, EU204365, EU204376, EU204346, EU204326, EU204351, EU204307, EU204371, EU204319, EU204358, EU204311, EU204370, EU204312, EU204341, EU204310, EU204357, EU204343, EU204305, EU204381, EU204375, EU204344, EU204373
---EF1aF1_e2: EU204436, EU204389, EF013211, EU204453, EU204454, EU204422, EU204450, EU204451, EU204414, EF013230, EU204439, EU204408, EU204438, EU204464, EU204440, EU204425, EU204409, EU204457, EU204426, EU204441, EU204405, EU204415, EU204449, EF013232, EU204406, EF013240, EU204404, EU204390, EU204419, EU204445, EU204456, EU204412, EU204411, EU204421, EU204433, EU204459, EU204458, EU204455, EF013251, EU204437, EU204417, EU204442, EU204398, EU204461, EU204410, EU204448, EU204402, EU204460, EU204403, EU204432, EU204401, EU204447, EU204434, EU204396, EF013296, EF013299, EU204435, EU204463
---EU204586, EU204541, EF013373, EU204604, EU204605, EU204573, EU204601, EU204602, EU204565, EF013392, EU204589, EU204559, EU204588, EU204615, EU204590, EU204576, EU204560, EU204608, EU204577, EU204591, EU204556, EU204566, EU204600, EF013394, EU204557, EF013402, EU204555, EU204570, EU204596, EU204607, EU204563, EU204562, EU204572, EU204583, EU204610, EU204595, EU204609, EU204606, EF013414, EU204587, EU204568, EU204592, EU204549, EU204612, EU204561, EU204599, EU204553, EU204611, EU204554, EU204582, EU204552, EU204598, EU204584, EU204547, EF013458, EF013461, EU204585, EU204614
---EU204511, EU204465, EF013534, EU204529, EU204530, EU204497, EU204526, EU204527, EU204490, EF013549, EU204514, EU204484, EU204513, EU204540, EU204515, EU204500, EU204485, EU204533, EU204501, EU204516, EU204481, EU204491, EU204525, EF013551, EU204482, EF013558, EU204480, EU204466, EU204494, EU204521, EU204532, EU204488, EU204487, EU204496, EU204508, EU204535, EU204520, EU204534, EU204531, EF013565, EU204512, EU204517, EU204474, EU204537, EU204486, EU204524, EU204478, EU204536, EU204479, EU204507, EU204477, EU204523, EU204509, EU204472, EF013598, EF013600, EU204510, EU204539
---EU204268, EU204222, EF013534, EU204286, EU204287, EU204254, EU204283, EU204284, EU204247, EF013549, EU204271, EU204241, EU204270, EU204297, EU204272, EU204257, EU204242, EU204290, EU204258, EU204273, EU204238, EU204248, EU204282, EF013551, EU204239, EF013558, EU204237, EU204223, EU204251, EU204278, EU204289, EU204245, EU204244, EU204253, EU204265, EU204292, EU204277, EU204291, EU204288, EF013565, EU204269, EU204274, EU204231, EU204294, EU204243, EU204281, EU204235, EU204293, EU204236, EU204264, EU204234, EU204280, EU204266, EU204229, EF013598, EF013600, EU204267, EU204296
---EU204192, EU204145, EF013662, EU204210, EU204211, EU204178, EU204207, EU204208, EU204170, EF013677, EU204195, EU204164, EU204194, EU204221, EU204196, EU204181, EU204165, EU204214, EU204182, EU204197, EU204161, EU204171, EU204206, EF013679, EU204162, EF013686, EU204160, EU204146, EU204175, EU204202, EU204213, EU204168, EU204167, EU204177, EU204189, EU204216, EU204201, EU204215, EU204212, EF013693, EU204193, EU204173, EU204198, EU204154, EU204218, EU204166, EU204205, EU204158, EU204217, EU204159, EU204188, EU204157, EU204204, EU204190, EU204152, EF013726, EF013728, EU204191, EU204220





***Data [#l57a712c]
-2,459 aligned nucleotide sites from the coding regions of four nuclear genes:
--elongation factor 1-F1 (EF1-F1) (1,075 bp)
--elongation factor 1-F2 (EF1-F2) (517 bp)
--wingless (409 bp)
--long-wavelength rhodopsin (opsin) (458 bp)
-All data in this study represent protein-coding (exon) sequences~
intervening introns in opsin and EF1F1 were not used because they could not be aligned confidently. 
-Sample: 65 attine taxa and 26 nonattine outgroups.~
Primers used for PCR amplification and sequencing are found in supporting information (SI) Table S1.
-Of the total 2,459 included nucleotide positions from all genes, 952 were variable and 847 parsimony informative. Sequences are deposited in GenBank; taxa and accession numbers are listed in Table S2.

***Phylogenetic Analyses [#u92e7045]
-(i) Maximum parsimony (MP) analyses
--PAUP* v4.0b10
---heuristic searches with tree bisection.reconnection (TBR) and 1,000 random-taxon-addition replicates. ~
Analyses identified 12 most-parsimonious trees (MPTs) of length  4,383, CI  0.270, RI  0.704. Successive-approximations-weighting analyses identified a single tree, one of the MPTs.
---Nonparametric bootstrap analyses used TBR branch-swapping and consisted of 1,000 pseudoreplicates, with 10 random-taxon-addition replicates per pseudoreplicate. 

-(ii) Maximum likelihood (ML)
-- ModelTest v3.06~
The data and the MPT identified by weighting were evaluated under the Akaike information criterion (AIC) as calculated in, 
---identifying the GTR model of evolution. 
--GARLI v0.951 using the GTR model (with six  rate categories), with a heuristiclosuccessiveapproximationsg likelihood of 24,868.84927. 
---Nonparametric bootstrap analyses consisted of 500 pseudoreplicates in GARLI under the same conditions as the ML search.
---A subsequent  search in PAUP* using the most likely tree identified by the GARLI searches as the starting tree and employing TBR branch-swapping and the GTRI model (with six  rate categories) resulted in exactly the same topology and likelihood score. 

-(iii) Bayesian nucleotide-model Markov Chain Monte Carlo (MCMC):~
MrBayes v3.1.2 (59). 
--Burn-in and run convergence were assessed by comparing the mean and variance of log likelihoods, both by eye and by using the program
---Tracer v1.3
---MrBayes  e e.stat f f output file
---MrBayes bthe split frequencies diagnostic.
--Eight character partitions for nucleotide-model analyses:
---four partitions consisting of the combined first and second codon positions for each of the four genes
---four partitions consisting of the third codon position for each of the four genes.
---based on ModelTest results~
the wingless third-position - GTR model~
opsin and EF1F2 third positions - separately assigned the HKYI model~
all other character partitions - separately assigned the GTRI model
-(iv) Bayesian codon-model MCMC~

**Phylogenetic Mapping of Agricultural Systems. [#m346c506]
//-Terminal taxa were assigned states for a single six-state character representing the four attine agricultural systems and leaf-cutter agriculture (i.e., no agriculture, lower agriculture, yeast agriculture, higher agriculture, leaf-cutter agriculture, coral-fungus agriculture).
//-Five species (Myrmicocrypta n. sp. Brazil, Mycetagroicus triangularis, Cyphomyrmex n. sp., Cyphomyrmex morschi, Trachymyrmex irmgardae, and Pseudoatta n. sp.) received  e eunknown f f (i.e.,  e e? f f) state assignments, and @Trachymyrmex papulatus received a  e elower agriculture f f state assignment based on a single garden collection from Argentina (a second colony from the same locality cultivated a typical higher attine garden). 
//-Character evolution was optimized onto the Bayesian codon-model consensus tree (with branch lengths) under both parsimony using MacClade and maximum likelihood using the StochChar module provided in the Mesquite package.
//-Under parsimony, ancestral-state optimizations were unambiguous. 
//-Under the Markov k-state 1-parameter model,  the likelihood that each agricultural system arose in the most recent common ancestor of the corresponding ant clade was, as a proportion of the total probability ( 1.0) distributed across the six character states, 0.9831 for lower agriculture, 0.9995 for yeast agriculture, 0.9905 for higher agriculture, 0.9924 for leaf-cutter agriculture, and 0.9998 for coral-fungus agriculture.

**Divergence Dating [#gae0451f]
//-We inferred divergence dates using both semiparametric and Bayesian relaxed clock methods.
//-The first method used was the semiparametric penalized likelihood approach implemented in r8s v1.7 (64, 65). 
//--Branch lengths were first estimated on the ML topology using PAUP* under a GTRI model. 
//--The Pogonomyrmex and two Myrmica species were used to root the tree during branch length estimation and were subsequently removed from all dating analyses. 
//--Thus, the root of the tree for all dating analyses represents the origin of the  e ecore myrmicines, f f a well supported clade established by previous work (33). 
//--Smoothing parameters were estimated by using the cross-validation feature in r8s. 
//--Confidence intervals were calculated by using 100 nonparametric bootstrap replicates of the dataset generated by Mesquite, followed by reestimation of branch lengths and divergence times for each replicate.
//-We calibrated three nodes with minimum-age constraints using attine Dominican amber fossils. 
//^^These fossils are 
//---(i) Apterostigma electropilosum, a member of the A. pilosum group 
//---(ii) Cyphomyrmex maya and Cyphomyrmex taino, both members of the C. rimosus group 
//---(iii) Trachymyrmex primaevus, a fossil of uncertain placement within the genus (but see below).