SPARQLthon28/Cufflinks2RDF

提供:TogoWiki

移動: 案内, 検索

目次

目的

微生物統合 MicrobeDB.jpにおいて、RNA-Seqの大規模解析データフローにおいて、cufflinksの出力結果を変換するRDFのデータモデル作成

Cufflinks入力ファイル

  • BAM
  • GTF (Genome-RDFから変換)
    • seqidはSPARQLthon27で確認したfastaヘッダ行に一致させた
    • feature typeは、微生物ゲノム CDS = exon および start_codon、stop_codon を機械的に追加した
    • gene_id はlocus_tagを利用した、transcript_idは"ranscript_"+ gene_idとした
refseq:NC_000913.3	RefSeq	exon	190	255	.	+	.	gene_id "b0001"; transcript_id "transcript_b0001;
refseq:NC_000913.3	RefSeq	CDS	190	252	.	+	.	gene_id "b0001"; transcript_id "transcript_b0001;
refseq:NC_000913.3	RefSeq	start_codon	253	255	.	+	.	gene_id "b0001"; transcript_id "transcript_b0001;
refseq:NC_000913.3	RefSeq	stop_codon	190	192	.	+	.	gene_id "b0001"; transcript_id "transcript_b0001;
refseq:NC_000913.3	RefSeq	exon	337	2799	.	+	.	gene_id "b0002"; transcript_id "transcript_b0002;
refseq:NC_000913.3	RefSeq	CDS	337	2796	.	+	.	gene_id "b0002"; transcript_id "transcript_b0002;
refseq:NC_000913.3	RefSeq	start_codon	2797	2799	.	+	.	gene_id "b0002"; transcript_id "transcript_b0002;
refseq:NC_000913.3	RefSeq	stop_codon	337	339	.	+	.	gene_id "b0002"; transcript_id "transcript_b0002;
refseq:NC_000913.3	RefSeq	exon	2801	3733	.	+	.	gene_id "b0003"; transcript_id "transcript_b0003;
refseq:NC_000913.3	RefSeq	CDS	2801	3730	.	+	.	gene_id "b0003"; transcript_id "transcript_b0003;
refseq:NC_000913.3	RefSeq	start_codon	3731	3733	.	+	.	gene_id "b0003"; transcript_id "transcript_b0003;
refseq:NC_000913.3	RefSeq	stop_codon	2801	2803	.	+	.	gene_id "b0003"; transcript_id "transcript_b0003;

Cufflinks出力ファイル

E. coli K-12 MG1655のゲノムにBowtie2でマッピングし、SAMToolsでBAMにしてCufflinksでGTFファイルを読み込んでFPKMを計算した結果

  • genes.fpkm_tracking
tracking_id	class_code	nearest_ref_id	gene_id	gene_short_name	tss_id	locus	length	coverage	FPKM	FPKM_conf_lo	FPKM_conf_hi	FPKM_status
b0001	-	-	b0001	-	-	refseq:NC_000913.3:189-255	-	-	2178.07	531.605	3824.54	OK
b0002	-	-	b0002	-	-	refseq:NC_000913.3:336-2799	-	-	29.5047	26.376	32.6333	OK
b0003	-	-	b0003	-	-	refseq:NC_000913.3:2800-3733	-	-	36.1379	30.0939	42.1819	OK
b0004	-	-	b0004	-	-	refseq:NC_000913.3:3733-5020	-	-	31.5031	26.8051	36.2011	OK
b0005	-	-	b0005	-	-	refseq:NC_000913.3:5233-5530	-	-	33.4489	17.6705	49.2272	OK
  • transcripts.gtf
refseq:NC_000913.3	Cufflinks	transcript	190	255	1000	+	.	gene_id "b0001"; transcript_id "transcript_b0001"; FPKM "2178.0736521866"; frac "1.000000"; conf_lo "531.604732"; conf_hi "3824.542572"; cov "501.695154";
refseq:NC_000913.3	Cufflinks	exon	190	255	1000	+	.	gene_id "b0001"; transcript_id "transcript_b0001"; exon_number "1"; FPKM "2178.0736521866"; frac "1.000000"; conf_lo "531.604732"; conf_hi "3824.542572"; cov "501.695154";
refseq:NC_000913.3	Cufflinks	transcript	337	2799	1000	+	.	gene_id "b0002"; transcript_id "transcript_b0002"; FPKM "29.5046541329"; frac "1.000000"; conf_lo "26.375989"; conf_hi "32.633319"; cov "11.383029";
refseq:NC_000913.3	Cufflinks	exon	337	2799	1000	+	.	gene_id "b0002"; transcript_id "transcript_b0002"; exon_number "1"; FPKM "29.5046541329"; frac "1.000000"; conf_lo "26.375989"; conf_hi "32.633319"; cov "11.383029";
refseq:NC_000913.3	Cufflinks	transcript	2801	3733	1000	+	.	gene_id "b0003"; transcript_id "transcript_b0003"; FPKM "36.1378547916"; frac "1.000000"; conf_lo "30.093856"; conf_hi "42.181853"; cov "13.894875";
refseq:NC_000913.3	Cufflinks	exon	2801	3733	1000	+	.	gene_id "b0003"; transcript_id "transcript_b0003"; exon_number "1"; FPKM "36.1378547916"; frac "1.000000"; conf_lo "30.093856"; conf_hi "42.181853"; cov "13.894875";

RDFに変換すべき情報

  • gene_id (genes.fpkm_tracking)
  • FPKM (genes.fpkm_tracking)
  • FPKM_conf_lo (genes.fpkm_tracking)
  • FPKM_conf_hi (genes.fpkm_tracking)
  • FPKM_status (genes.fpkm_tracking)
  • メタデータ
    • ゲノム配列リファレンス情報 bioproject=PRJNA57659, refseq=NC_000913.3
    • マッピングデータ sra=SRA141679, srx=SRX474167
    • 解析手法 "Bowtie2でマッピングし、SAMToolsでBAMにしてCufflinksGTFファイルを読み込んでFPKMを計算"

確認したこと

  • cufflinksの入力GTFはGenome-RDFからGTFに変換しており、RDFが既に存在するため、genes.fpkm_tracking を入力とする

調査

Cufflinks-rdf@bh12

biohackathon2012で試作されていた https://github.com/dbcls/bh12/wiki/Cufflinks-rdf

RDF形式モデル

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ns0: <http://purl.obolibrary.org/obo/> .
@prefix gtf: <http://genome.db/gtf/> .
@prefix ngs: <http://genome.db/ngs/> .
<http://genome.db/ensembl/ENST00000417324>  rdf:type    ns0:SO_0000833 .
<http://genome.db/ensembl/ENST00000417324>  rdfs:label  "ENST00000417324" .
<http://genome.db/ensembl/ENST00000417324>  gtf:parent_gene <http://genome.db/ensembl/ENSG00000237613> .
<http://genome.db/ensembl/ENST00000417324>  gtf:uuid    gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
<http://genome.db/sample/SQ_0081>   ngs:hasSample  gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
<http://genome.db/project/Naive_T0> ngs:hasProject gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
<http://genome.db/run/110908_H125_0119_AB01W2ABXX> ngs:hasRun gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   ngs:sample  <http://genome.db/sample/SQ_0081> .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   ngs:project <http://genome.db/project/Naive_T0> .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   ngs:run <http://genome.db/run/110908_H125_0119_AB01W2ABXX> .
<http://genome.db/ensembl/ENST00000417324>  gtf:location    <http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> .
<http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> a   gtf:cufflinks_transcript .
<http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> gtf:seqname "1" .
<http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> gtf:start   34554 .
<http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> gtf:stop    36081 .
<http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> gtf:strand  "-" .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   gtf:FPKM    1.5 .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   gtf:frac    0.0 .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   gtf:conf_lo 0.0 .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   gtf:conf_hi 0.0 .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   gtf:cov 0.0 .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   gtf:full_read_support   "no" .
gtf:uuid-6c679944-1cb0-4891-905a-fbbf507c2ec9   rdf:type    ns0:SO_0000852 .
gtf:uuid-6c679944-1cb0-4891-905a-fbbf507c2ec9   gtf:parent_transcript   gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
gtf:uuid-6c679944-1cb0-4891-905a-fbbf507c2ec9   gtf:location    <http://genome.db/coords/1:34554_35174> .
<http://genome.db/coords/1:34554_35174> a   gtf:caffulinks_exon .
<http://genome.db/coords/1:34554_35174> gtf:seqname "1" .
<http://genome.db/coords/1:34554_35174> gtf:start   34554 .
<http://genome.db/coords/1:34554_35174> gtf:stop    35174 .
<http://genome.db/coords/1:34554_35174> gtf:strand  "-" .
<http://genome.db/coords/1:34554_35174> gtf:exon_number 1 .
gtf:uuid-74fdb480-0b2a-4439-8050-f8ea508b30bf   rdf:type    ns0:SO_0000852 .
gtf:uuid-74fdb480-0b2a-4439-8050-f8ea508b30bf   gtf:parent_transcript   gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
gtf:uuid-74fdb480-0b2a-4439-8050-f8ea508b30bf   gtf:location    <http://genome.db/coords/1:35277_35481> .
<http://genome.db/coords/1:35277_35481> a   gtf:caffulinks_exon .
<http://genome.db/coords/1:35277_35481> gtf:seqname "1" .
<http://genome.db/coords/1:35277_35481> gtf:start   35277 .
<http://genome.db/coords/1:35277_35481> gtf:stop    35481 .
<http://genome.db/coords/1:35277_35481> gtf:strand  "-" .
<http://genome.db/coords/1:35277_35481> gtf:exon_number 2 .
gtf:uuid-62592261-26e6-4ac5-81b0-b5c4e73d87ad   rdf:type    ns0:SO_0000852 .
gtf:uuid-62592261-26e6-4ac5-81b0-b5c4e73d87ad   gtf:parent_transcript   gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
gtf:uuid-62592261-26e6-4ac5-81b0-b5c4e73d87ad   gtf:location    <http://genome.db/coords/1:35721_36081> .
<http://genome.db/coords/1:35721_36081> a   gtf:caffulinks_exon .
<http://genome.db/coords/1:35721_36081> gtf:seqname "1" .
<http://genome.db/coords/1:35721_36081> gtf:start   35721 .
<http://genome.db/coords/1:35721_36081> gtf:stop    36081 .
<http://genome.db/coords/1:35721_36081> gtf:strand  "-" .
<http://genome.db/coords/1:35721_36081> gtf:exon_number 3 .
<http://genome.db/ensembl/ENST00000461467>  rdf:type    ns0:SO_0000833 .
<http://genome.db/ensembl/ENST00000461467>  rdfs:label  "ENST00000461467" .
<http://genome.db/ensembl/ENST00000461467>  gtf:parent_gene <http://genome.db/ensembl/ENSG00000237613> .
<http://genome.db/ensembl/ENST00000461467>  gtf:uuid    gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57 .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   ngs:sample  <http://genome.db/sample/SQ_0081> .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   ngs:project <http://genome.db/project/Naive_T0> .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   ngs:run <http://genome.db/run/110908_H125_0119_AB01W2ABXX> .
<http://genome.db/ensembl/ENST00000461467>  gtf:location    <http://genome.db/coords/1:35245_35481-35721_36073:r> .
<http://genome.db/coords/1:35245_35481-35721_36073:r>   a   gtf:cufflinks_transcript .
<http://genome.db/coords/1:35245_35481-35721_36073:r>   gtf:seqname "1" .
<http://genome.db/coords/1:35245_35481-35721_36073:r>   gtf:start   35245 .
<http://genome.db/coords/1:35245_35481-35721_36073:r>   gtf:stop    36073 .
<http://genome.db/coords/1:35245_35481-35721_36073:r>   gtf:strand  "-" .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   gtf:FPKM    4.5 .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   gtf:frac    0.0 .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   gtf:conf_lo 0.0 .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   gtf:conf_hi 0.0 .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   gtf:cov 0.0 .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   gtf:full_read_support   "no" .
gtf:uuid-01197dff-a1d8-4e61-8202-c27d8047986b   rdf:type    ns0:SO_0000852 .
gtf:uuid-01197dff-a1d8-4e61-8202-c27d8047986b   gtf:parent_transcript   gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57 .
gtf:uuid-01197dff-a1d8-4e61-8202-c27d8047986b   gtf:location    <http://genome.db/coords/1:35245_35481> .
<http://genome.db/coords/1:35245_35481> a   gtf:caffulinks_exon .
<http://genome.db/coords/1:35245_35481> gtf:seqname "1" .
<http://genome.db/coords/1:35245_35481> gtf:start   35245 .
<http://genome.db/coords/1:35245_35481> gtf:stop    35481 .
<http://genome.db/coords/1:35245_35481> gtf:strand  "-" .
<http://genome.db/coords/1:35245_35481> gtf:exon_number 1 .

https://github.com/helios/bioruby-ngs一部として実装されているが、指定しているsamtoolsが古いためか、gem installでこける

参考

GTF形式