SPARQLthon28/Cufflinks2RDF
提供:TogoWiki
目次 |
目的
微生物統合 MicrobeDB.jpにおいて、RNA-Seqの大規模解析データフローにおいて、cufflinksの出力結果を変換するRDFのデータモデル作成
Cufflinks入力ファイル
- BAM
- GTF (Genome-RDFから変換)
- seqidはSPARQLthon27で確認したfastaヘッダ行に一致させた
- feature typeは、微生物ゲノム CDS = exon および start_codon、stop_codon を機械的に追加した
- gene_id はlocus_tagを利用した、transcript_idは"ranscript_"+ gene_idとした
refseq:NC_000913.3 RefSeq exon 190 255 . + . gene_id "b0001"; transcript_id "transcript_b0001; refseq:NC_000913.3 RefSeq CDS 190 252 . + . gene_id "b0001"; transcript_id "transcript_b0001; refseq:NC_000913.3 RefSeq start_codon 253 255 . + . gene_id "b0001"; transcript_id "transcript_b0001; refseq:NC_000913.3 RefSeq stop_codon 190 192 . + . gene_id "b0001"; transcript_id "transcript_b0001; refseq:NC_000913.3 RefSeq exon 337 2799 . + . gene_id "b0002"; transcript_id "transcript_b0002; refseq:NC_000913.3 RefSeq CDS 337 2796 . + . gene_id "b0002"; transcript_id "transcript_b0002; refseq:NC_000913.3 RefSeq start_codon 2797 2799 . + . gene_id "b0002"; transcript_id "transcript_b0002; refseq:NC_000913.3 RefSeq stop_codon 337 339 . + . gene_id "b0002"; transcript_id "transcript_b0002; refseq:NC_000913.3 RefSeq exon 2801 3733 . + . gene_id "b0003"; transcript_id "transcript_b0003; refseq:NC_000913.3 RefSeq CDS 2801 3730 . + . gene_id "b0003"; transcript_id "transcript_b0003; refseq:NC_000913.3 RefSeq start_codon 3731 3733 . + . gene_id "b0003"; transcript_id "transcript_b0003; refseq:NC_000913.3 RefSeq stop_codon 2801 2803 . + . gene_id "b0003"; transcript_id "transcript_b0003;
Cufflinks出力ファイル
E. coli K-12 MG1655のゲノムにBowtie2でマッピングし、SAMToolsでBAMにしてCufflinksでGTFファイルを読み込んでFPKMを計算した結果
- genes.fpkm_tracking
tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status b0001 - - b0001 - - refseq:NC_000913.3:189-255 - - 2178.07 531.605 3824.54 OK b0002 - - b0002 - - refseq:NC_000913.3:336-2799 - - 29.5047 26.376 32.6333 OK b0003 - - b0003 - - refseq:NC_000913.3:2800-3733 - - 36.1379 30.0939 42.1819 OK b0004 - - b0004 - - refseq:NC_000913.3:3733-5020 - - 31.5031 26.8051 36.2011 OK b0005 - - b0005 - - refseq:NC_000913.3:5233-5530 - - 33.4489 17.6705 49.2272 OK
- transcripts.gtf
refseq:NC_000913.3 Cufflinks transcript 190 255 1000 + . gene_id "b0001"; transcript_id "transcript_b0001"; FPKM "2178.0736521866"; frac "1.000000"; conf_lo "531.604732"; conf_hi "3824.542572"; cov "501.695154"; refseq:NC_000913.3 Cufflinks exon 190 255 1000 + . gene_id "b0001"; transcript_id "transcript_b0001"; exon_number "1"; FPKM "2178.0736521866"; frac "1.000000"; conf_lo "531.604732"; conf_hi "3824.542572"; cov "501.695154"; refseq:NC_000913.3 Cufflinks transcript 337 2799 1000 + . gene_id "b0002"; transcript_id "transcript_b0002"; FPKM "29.5046541329"; frac "1.000000"; conf_lo "26.375989"; conf_hi "32.633319"; cov "11.383029"; refseq:NC_000913.3 Cufflinks exon 337 2799 1000 + . gene_id "b0002"; transcript_id "transcript_b0002"; exon_number "1"; FPKM "29.5046541329"; frac "1.000000"; conf_lo "26.375989"; conf_hi "32.633319"; cov "11.383029"; refseq:NC_000913.3 Cufflinks transcript 2801 3733 1000 + . gene_id "b0003"; transcript_id "transcript_b0003"; FPKM "36.1378547916"; frac "1.000000"; conf_lo "30.093856"; conf_hi "42.181853"; cov "13.894875"; refseq:NC_000913.3 Cufflinks exon 2801 3733 1000 + . gene_id "b0003"; transcript_id "transcript_b0003"; exon_number "1"; FPKM "36.1378547916"; frac "1.000000"; conf_lo "30.093856"; conf_hi "42.181853"; cov "13.894875";
RDFに変換すべき情報
- gene_id (genes.fpkm_tracking)
- FPKM (genes.fpkm_tracking)
- FPKM_conf_lo (genes.fpkm_tracking)
- FPKM_conf_hi (genes.fpkm_tracking)
- FPKM_status (genes.fpkm_tracking)
- メタデータ
- ゲノム配列リファレンス情報 bioproject=PRJNA57659, refseq=NC_000913.3
- マッピングデータ sra=SRA141679, srx=SRX474167
- 解析手法 "Bowtie2でマッピングし、SAMToolsでBAMにしてCufflinksGTFファイルを読み込んでFPKMを計算"
確認したこと
- cufflinksの入力GTFはGenome-RDFからGTFに変換しており、RDFが既に存在するため、genes.fpkm_tracking を入力とする
調査
Cufflinks-rdf@bh12
biohackathon2012で試作されていた https://github.com/dbcls/bh12/wiki/Cufflinks-rdf
RDF形式モデル
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix ns0: <http://purl.obolibrary.org/obo/> . @prefix gtf: <http://genome.db/gtf/> . @prefix ngs: <http://genome.db/ngs/> . <http://genome.db/ensembl/ENST00000417324> rdf:type ns0:SO_0000833 . <http://genome.db/ensembl/ENST00000417324> rdfs:label "ENST00000417324" . <http://genome.db/ensembl/ENST00000417324> gtf:parent_gene <http://genome.db/ensembl/ENSG00000237613> . <http://genome.db/ensembl/ENST00000417324> gtf:uuid gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e . <http://genome.db/sample/SQ_0081> ngs:hasSample gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e . <http://genome.db/project/Naive_T0> ngs:hasProject gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e . <http://genome.db/run/110908_H125_0119_AB01W2ABXX> ngs:hasRun gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e . gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e ngs:sample <http://genome.db/sample/SQ_0081> . gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e ngs:project <http://genome.db/project/Naive_T0> . gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e ngs:run <http://genome.db/run/110908_H125_0119_AB01W2ABXX> . <http://genome.db/ensembl/ENST00000417324> gtf:location <http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> . <http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> a gtf:cufflinks_transcript . <http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> gtf:seqname "1" . <http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> gtf:start 34554 . <http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> gtf:stop 36081 . <http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> gtf:strand "-" . gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e gtf:FPKM 1.5 . gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e gtf:frac 0.0 . gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e gtf:conf_lo 0.0 . gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e gtf:conf_hi 0.0 . gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e gtf:cov 0.0 . gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e gtf:full_read_support "no" . gtf:uuid-6c679944-1cb0-4891-905a-fbbf507c2ec9 rdf:type ns0:SO_0000852 . gtf:uuid-6c679944-1cb0-4891-905a-fbbf507c2ec9 gtf:parent_transcript gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e . gtf:uuid-6c679944-1cb0-4891-905a-fbbf507c2ec9 gtf:location <http://genome.db/coords/1:34554_35174> . <http://genome.db/coords/1:34554_35174> a gtf:caffulinks_exon . <http://genome.db/coords/1:34554_35174> gtf:seqname "1" . <http://genome.db/coords/1:34554_35174> gtf:start 34554 . <http://genome.db/coords/1:34554_35174> gtf:stop 35174 . <http://genome.db/coords/1:34554_35174> gtf:strand "-" . <http://genome.db/coords/1:34554_35174> gtf:exon_number 1 . gtf:uuid-74fdb480-0b2a-4439-8050-f8ea508b30bf rdf:type ns0:SO_0000852 . gtf:uuid-74fdb480-0b2a-4439-8050-f8ea508b30bf gtf:parent_transcript gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e . gtf:uuid-74fdb480-0b2a-4439-8050-f8ea508b30bf gtf:location <http://genome.db/coords/1:35277_35481> . <http://genome.db/coords/1:35277_35481> a gtf:caffulinks_exon . <http://genome.db/coords/1:35277_35481> gtf:seqname "1" . <http://genome.db/coords/1:35277_35481> gtf:start 35277 . <http://genome.db/coords/1:35277_35481> gtf:stop 35481 . <http://genome.db/coords/1:35277_35481> gtf:strand "-" . <http://genome.db/coords/1:35277_35481> gtf:exon_number 2 . gtf:uuid-62592261-26e6-4ac5-81b0-b5c4e73d87ad rdf:type ns0:SO_0000852 . gtf:uuid-62592261-26e6-4ac5-81b0-b5c4e73d87ad gtf:parent_transcript gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e . gtf:uuid-62592261-26e6-4ac5-81b0-b5c4e73d87ad gtf:location <http://genome.db/coords/1:35721_36081> . <http://genome.db/coords/1:35721_36081> a gtf:caffulinks_exon . <http://genome.db/coords/1:35721_36081> gtf:seqname "1" . <http://genome.db/coords/1:35721_36081> gtf:start 35721 . <http://genome.db/coords/1:35721_36081> gtf:stop 36081 . <http://genome.db/coords/1:35721_36081> gtf:strand "-" . <http://genome.db/coords/1:35721_36081> gtf:exon_number 3 . <http://genome.db/ensembl/ENST00000461467> rdf:type ns0:SO_0000833 . <http://genome.db/ensembl/ENST00000461467> rdfs:label "ENST00000461467" . <http://genome.db/ensembl/ENST00000461467> gtf:parent_gene <http://genome.db/ensembl/ENSG00000237613> . <http://genome.db/ensembl/ENST00000461467> gtf:uuid gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57 . gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57 ngs:sample <http://genome.db/sample/SQ_0081> . gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57 ngs:project <http://genome.db/project/Naive_T0> . gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57 ngs:run <http://genome.db/run/110908_H125_0119_AB01W2ABXX> . <http://genome.db/ensembl/ENST00000461467> gtf:location <http://genome.db/coords/1:35245_35481-35721_36073:r> . <http://genome.db/coords/1:35245_35481-35721_36073:r> a gtf:cufflinks_transcript . <http://genome.db/coords/1:35245_35481-35721_36073:r> gtf:seqname "1" . <http://genome.db/coords/1:35245_35481-35721_36073:r> gtf:start 35245 . <http://genome.db/coords/1:35245_35481-35721_36073:r> gtf:stop 36073 . <http://genome.db/coords/1:35245_35481-35721_36073:r> gtf:strand "-" . gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57 gtf:FPKM 4.5 . gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57 gtf:frac 0.0 . gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57 gtf:conf_lo 0.0 . gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57 gtf:conf_hi 0.0 . gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57 gtf:cov 0.0 . gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57 gtf:full_read_support "no" . gtf:uuid-01197dff-a1d8-4e61-8202-c27d8047986b rdf:type ns0:SO_0000852 . gtf:uuid-01197dff-a1d8-4e61-8202-c27d8047986b gtf:parent_transcript gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57 . gtf:uuid-01197dff-a1d8-4e61-8202-c27d8047986b gtf:location <http://genome.db/coords/1:35245_35481> . <http://genome.db/coords/1:35245_35481> a gtf:caffulinks_exon . <http://genome.db/coords/1:35245_35481> gtf:seqname "1" . <http://genome.db/coords/1:35245_35481> gtf:start 35245 . <http://genome.db/coords/1:35245_35481> gtf:stop 35481 . <http://genome.db/coords/1:35245_35481> gtf:strand "-" . <http://genome.db/coords/1:35245_35481> gtf:exon_number 1 .
https://github.com/helios/bioruby-ngs の一部として実装されているが、指定しているsamtoolsが古いためか、gem installでこける