BH12.12/SPARQLthon/RDFゲノムのデータ生成と改善

提供：TogoWiki

移動：案内, 検索

現在 fat:8890 に入っている RDFゲノムは v2 だが、このバージョンの RDF にはいろいろまだ問題がある。

リンクに関する問題

RefSeq では、リンクは基本的に /db_xref="DB名:エントリID" のように書かれているが、そのDB名はいったい何種類あるのか、リンク先の URL はどうなるのか自明ではない。

/db_xref を rdfs:seeAlso に変換するとして、リンク先にはどのドメインの URI を指定するのがよいか
- /db_xref="taxon:###" や /protein_id="###" みたいに xref の出現場所が少々バラバラ、重要な taxon は seeAlso でいいのか？
- /db_xref="ERIC:###" のような、リンク先にエントリ単位の URL がなくリンクできないものもある（後述）

リンクの検索に関して

リンク先の URI がどのデータベースへのリンクかを知るためには、

リンク種ごとに predicate を変える or リンク先の URI にクラスをつける
- UniProt に習って、各 URI にクラス指定するほうがよいだろう

select *
from <http://genome.db/>
where {
 ?organism rdfs:label ?name .
 ?organism rdfs:seeAlso ?url .
 FILTER regex(str(?url), "taxonomy")
}

こういうのは遅いので

prefix ds: <http://identifiers.org/dataset/>
select *
from <http://genome.db/>
where {
 ?organism rdfs:label ?name .
 ?organism rdfs:seeAlso ?url .
 ?url rdf:type ds:Taxonomy
}

のように取りたい。（とうぜん元データはアホほどでかくなる → URL の prefix マッチなどを高速にできるトリプルストアと SPARQL 記法がほしい？）

リンクのデータ生成に関して

RDFゲノム v4 以降のリンク先では、極力 Identifiers.org の URI を利用することにした。（ちなみに v3 は urn:uuid:### を使ったバージョンで、LSID のように UUID のリゾルバー問題があるのでやめることにする）この時リンク先の URI がどのデータベースのものかを明記するため、 MIRIAM の XML ダンプから各データベースを rdf:type void:Dataset として宣言した RDF を生成することにした。

http://www.ebi.ac.uk/miriam/main/collections/

miriam_xml2rdf.rb https://gist.github.com/3985701
- 以前 http://semantic.togodb.dbcls.jp/togodb/view/miriam 用につくった miriam_xml2csv.rb https://gist.github.com/1672112 の改変版
- 生成されたデータ https://dl.dropbox.com/u/429992/20121031-miriam.ttl

@prefix : <http://identifiers.org/dataset/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix miriam: <http://www.biomodels.net/MIRIAM/> .

<>
  dcterms:date "2012-10-31T07:27:46+00:00" ;
  dcterms:hasVersion "2012-10-26T10:58:16+01:00" .

:Ensembl
  rdf:type void:Dataset ;
  rdfs:label "Ensembl" ;
  rdfs:comment "Ensembl is a joint project between EMBL - EBI and the Sanger Institute to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes." ;
  miriam:datatype "MIR:00000003" ;
  miriam:urn <urn:miriam:ensembl> ;
  miriam:url <http://identifiers.org/ensembl/> ;
  miriam:namespace "ensembl" .

    :

:Uniprot
  rdf:type void:Dataset ;
  rdfs:label "UniProt" ;
  rdfs:comment "UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR." ;
  miriam:datatype "MIR:00000005" ;
  miriam:urn <urn:miriam:uniprot> ;
  miriam:url <http://identifiers.org/uniprot/> ;
  miriam:namespace "uniprot" .

:Taxonomy
  rdf:type void:Dataset ;
  rdfs:label "Taxonomy" ;
  rdfs:comment "The taxonomy contains the relationships between all living forms for which nucleic acid or protein sequence have been determined." ;
  miriam:datatype "MIR:00000006" ;
  miriam:urn <urn:miriam:taxonomy> ;
  miriam:url <http://identifiers.org/taxonomy/> ;
  miriam:namespace "taxonomy" .

    :

RefSeq prokaryote ゲノムエントリに含まれる全 /db_xref

下記を JSON ぽくしてみたもの：

https://dl.dropbox.com/u/429992/20121031-miriam.json

リンク先があるもの：

ASAP -- http://identifiers.org/asap/
ATCC -- http://identifiers.org/atcc/
CDD -- http://identifiers.org/cdd/
EcoGene -- http://identifiers.org/ecogene/
GO -- http://identifiers.org/obo.go/
GOA -- http://identifiers.org/goa/
GeneID -- http://identifiers.org/ncbigene/
Greengenes -- http://identifiers.org/greengenes/
HSSP -- http://identifiers.org/hssp/
ISFinder -- http://identifiers.org/isfinder/
InterPro -- http://identifiers.org/interpro/
NBRC -- http://identifiers.org/nbrc/
PDB -- http://identifiers.org/pdb/
PFAM -- http://identifiers.org/pfam/
REBASE -- http://identifiers.org/rebase/
TIGRFAM -- http://identifiers.org/tigrfam/
UniProtKB/Swiss-Prot -- http://identifiers.org/uniprot/
UniProtKB/TrEMBL -- http://identifiers.org/uniprot/
taxon -- http://identifiers.org/taxonomy/

リンクに注意がいるもの：

HOMD -- http://identifiers.org/homd.taxon/
- /db_xref="HOMD:tax_721" などから tax_ を取ってリンク
ECOCYC -- http://identifiers.org/biocyc/ECOCYC:
- ECOCYC 単独では identifiers.org にないので、BIOCYC に ECOCYC: をつけてリンク
- http://identifiers.org/biocyc/ECOCYC:G7952

Identifiers.org にリンク先がないもの：

ERIC
- Enteropathogen Resource Integration Center
- http://www.ericbrc.org/portal/eric/
GI
- NCBI GI は protein と nucleotide が混ざっているので、元データのどこに出てきたかでリンク先を変える必要がある？
- http://www.ncbi.nlm.nih.gov/Sitemap/sequenceIDs.html
HMP
- Human Microbiome Project
- http://www.hmpdacc.org/
- http://www.hmpdacc-resources.org/hmp_catalog/main.cgi?section=HmpSummary&page=displayHmpProject&hmp_id=0536
PSEUDO
- EMBL pseudo protein identifier
Pathema
- Pathema Genome Resource
- http://pathema.jcvi.org/
PseudoCap
- Pseudomonas Genome Database
- http://www.pseudomonas.com/

これらの DB 名が何を指しているかは探しにくいが、DDBJ のリンク表 (藤澤さんに頂いたexcel) で解決した。ただし、エントリ単位でリンクできない DB であるなどの事情で Identifiers.org にまだ登録されていないのかもしれない…。

配列セットを生物種ごとに束ねる問題

taxid でバインド？