SPARQLthon46-hands-on-workshop

提供：TogoWiki

移動：案内, 検索

Fusekiのインストール

以下の記事を参考に、Fuseki のインストールを行う
- Fusekiをダウンロードして立ち上げるまで。Mac OS X編
- Fusekiをダウンロードして立ち上げるまで。Windows 10編

SPARQL 体験：DBPedia

DBpedia は、Wikipediaから構造化データを抽出してLinked Open Dataとして再公開しているコミュニティプロジェクト。主にWikipediaのInfoboxやリンク関係等を扱っている。
- DBpediaと同様に、日本語Wikipeidaから、構造化データを抜き出してLOD化するプロジェクト DBpedia Japanse もある。

Qiita の記事 Apache Jena FusekiにデータをロードしてSPARQLを試すのサンプルを試してみる。
- この例では、SPARQL 1.1 から導入された SPARQL Update （データを更新するためのクエリに関する仕様）に含まれている、LOADオペレーションを利用してRDFデータを取得している。

DBpedia の SPARQLエンドポイントでも、同じクエリーを試してみる。
- DBpedia SPARQL endpoint
- DBpedia Japanse SPARQL endpoint

SPARQL 体験：UniProt

UniProt RDFにSPARQLで問い合わせを行う。
- UniProt の SPARQLエンドポイントに対してSPARQL検索を行うか、または、ダウンロードしてきた適当なサイズのUniProt RDFをFusekiにロードしたものに対してSPARQL検索を行うか、状況に応じて、いずれかを行う。
  - UniProt SPARQL エンドポイント
  - UniProt Proteomes の Escherichia coli K-12 RDF (3566406 トリプル) RDF/XML, RDF/XML(gzipped), Turtle
    - サイズが大きいので Fuseki の Dataset type が In-memory だとアップロードが止まる場合があります。Persistent で Fuseki の Dataset を作成しましょう
  - UniProt RDF ftp（EBIのミラー）
    - go.owl (解凍したもの)

Example 1

<http://purl.uniprot.org/core/Protein> のインスタンス（=UniProt のエントリー）を25件取得する
- UniProt のオントロジー <http://purl.uniprot.org/core/> を、PREFIX 定義しておくと、core:Protein のように記述できるので便利。

ローカル環境向けSPARQL

 PREFIX core: <http://purl.uniprot.org/core/>
 
 SELECT ?protein
 WHERE {
   ?protein a core:Protein
 }
 LIMIT 25

エンドポイント向けSPARQL

 PREFIX core: <http://purl.uniprot.org/core/>
 
 SELECT ?protein
 WHERE {
   ?protein a core:Protein .
   ?protein core:organism <http://purl.uniprot.org/taxonomy/83333> .
 }
 LIMIT 25

Example 2: COUNT

大腸菌K-12株の、 <http://purl.uniprot.org/core/Protein> のインスタンス（=UniProt のエントリー）の数を数える。
ローカル環境向けSPARQL（ローカル環境には、E.coli RDFのみ入っている状態）

 PREFIX core: <http://purl.uniprot.org/core/>
 
 SELECT (COUNT(?subject) AS ?EntryNumber)
 WHERE {
   ?subject ?predicate core:Protein
 }

エンドポイント向けSPARQL
- UniProt SPARQLエンドポイントには、E.coli だけでなく全てのデータが入っているので、検索対象の生物種を指定するトリプルパターンを追記する必要がある（以下の例でも同じ）。

 PREFIX core: <http://purl.uniprot.org/core/>
 
 SELECT (COUNT(?protein) AS ?EntryNumber)
 WHERE {
   ?protein a core:Protein .
   ?protein core:organism <http://purl.uniprot.org/taxonomy/83333> .
 }

Example 3

Gene Ontology (GO) の GO_0003700 "nucleic acid binding transcription factor activity" 以下のGOがアノテーションされているタンパク質を取得する

ローカル環境向けSPARQL

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX core: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?protein ?go
WHERE {
  ?protein a core:Protein .
  ?protein core:classifiedWith ?go .
  ?go rdfs:subClassOf* ?go2 .
  ?go2 rdfs:label "nucleic acid binding transcription factor activity" .
}

エンドポイント向けSPARQL

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX core: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?protein ?go
WHERE {
  ?protein a core:Protein .
  ?protein core:organism <http://purl.uniprot.org/taxonomy/83333> .
  ?protein core:classifiedWith ?go .
  ?go rdfs:subClassOf* ?go2 .
  ?go2 rdfs:label "nucleic acid binding transcription factor activity" .
}

Example 4: 集約関数 GROUP BY

Gene Ontology (GO) の GO_0003700 "nucleic acid binding transcription factor activity" 以下のGOがアノテーションされているタンパク質のユニークな一覧を取得する

ローカル環境向けSPARQL

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX core: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?protein
WHERE {
  ?protein a core:Protein .
  ?protein core:classifiedWith ?go .
  ?go rdfs:subClassOf* ?go2 .
  ?go2 rdfs:label "DNA binding" .
} GROUP BY ?protein

エンドポイント向けSPARQL

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX core: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?protein
WHERE {
  ?protein a core:Protein .
  ?protein core:organism <http://purl.uniprot.org/taxonomy/83333> .
  ?protein core:classifiedWith ?go .
  ?go rdfs:subClassOf* ?go2 .
  ?go2 rdfs:label "DNA binding" .
} GROUP BY ?protein

Example 5: BIND, GROUP_CONCAT, STRAFTER

Gene Ontology (GO) の GO_0003700 "nucleic acid binding transcription factor activity" 以下のGOがアノテーションされているタンパク質のユニークな一覧に、アノテーションされたGO IDを併記する。

ローカル環境向けSPARQL

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX core: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?protein (GROUP_CONCAT(DISTINCT ?goid; SEPARATOR=",") AS ?goids)
WHERE {
  ?protein a core:Protein .
  ?protein core:classifiedWith ?go .
  ?go rdfs:subClassOf+ ?go2 .
  ?go rdfs:label ?label .
  ?go2 rdfs:label "DNA binding" .
  BIND(STRAFTER(STR(?go), "obo/") AS ?goid)
} GROUP BY ?protein

エンドポイント向けSPARQL

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX core: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?protein	 (GROUP_CONCAT(DISTINCT ?goid; SEPARATOR=",") AS ?goids)
WHERE {
  ?protein a core:Protein .
  ?protein core:organism <http://purl.uniprot.org/taxonomy/83333> .
  ?protein core:classifiedWith ?go .
  ?go rdfs:subClassOf+ ?go2 .
  ?go rdfs:label ?label .
  ?go2 rdfs:label "DNA binding" .
  BIND(STRAFTER(STR(?go), "obo/") AS ?goid)
} GROUP BY ?protein

Example 6: CONCAT

Gene Ontology (GO) の GO_0003700 "nucleic acid binding transcription factor activity" 以下のGOがアノテーションされているタンパク質のユニークな一覧に、アノテーションされたGO IDおよびそのラベルを併記する。
ローカル環境向けSPARQL

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX core: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?protein	 (GROUP_CONCAT(DISTINCT ?goid; SEPARATOR=", ") AS ?goids)
WHERE {
  ?protein a core:Protein .
  ?protein core:classifiedWith ?go .
  ?go rdfs:subClassOf+ ?go2 .
  ?go rdfs:label ?label .
  BIND(CONCAT(CONCAT('"', ?label), '"')AS ?quoted_label)
  ?go2 rdfs:label "nucleotide binding" .
  BIND(CONCAT(CONCAT(STRAFTER(STR(?go), "/obo/"), ":"), ?quoted_label) AS ?goid)
} GROUP BY ?protein

エンドポイント向けSPARQL

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX core: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?protein	 (GROUP_CONCAT(DISTINCT ?goid; SEPARATOR=", ") AS ?goids)
WHERE {
  ?protein a core:Protein .
  ?protein core:organism <http://purl.uniprot.org/taxonomy/83333> .
  ?protein core:classifiedWith ?go .
  ?go rdfs:subClassOf+ ?go2 .
  ?go rdfs:label ?label .
  BIND(CONCAT(CONCAT('"', ?label), '"')AS ?quoted_label)
  ?go2 rdfs:label "nucleotide binding" .
  BIND(CONCAT(CONCAT(STRAFTER(STR(?go), "/obo/"), ":"), ?quoted_label) AS ?goid)
} GROUP BY ?protein

Example 7

ローカル環境向けSPARQL

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX core: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?protein (GROUP_CONCAT(?goid; SEPARATOR=", ") AS ?goids)
WHERE {
  ?protein a core:Protein .
  ?protein core:classifiedWith ?go .
  ?go rdfs:subClassOf+ ?go2 .
  ?go rdfs:label ?label .
  BIND(CONCAT(CONCAT('"', ?label), '"')AS ?quoted_label)
  ?go2 rdfs:label "nucleic acid binding transcription factor activity" .
  BIND(CONCAT(CONCAT(STRAFTER(STR(?go), "/obo/"), ":"), ?quoted_label) AS ?goid)
} GROUP BY ?protein

エンドポイント向けSPARQL

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX core: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?protein (GROUP_CONCAT(?goid; SEPARATOR=", ") AS ?goids)
WHERE {
  ?protein a core:Protein .
  ?protein core:organism <http://purl.uniprot.org/taxonomy/83333> .
  ?protein core:classifiedWith ?go .
  ?go rdfs:subClassOf+ ?go2 .
  ?go rdfs:label ?label .
  BIND(CONCAT(CONCAT('"', ?label), '"')AS ?quoted_label)
  ?go2 rdfs:label "nucleic acid binding transcription factor activity" .
  BIND(CONCAT(CONCAT(STRAFTER(STR(?go), "/obo/"), ":"), ?quoted_label) AS ?goid)
} GROUP BY ?protein

質疑

GO IDをDISTINCTする必要性について

Example5 で?go_idに対してDISTINCTを使って重複を省いているが、fusekiで実行したクエリではDISTINCTを書かなくてもそもそも重複データが出てこないのはなぜか？

FusekiでExample5のDISTINCTを書かなかった場合の結果(一部)

<http://purl.uniprot.org/uniprot/P04152>    |   "GO_0003684"

uniprot(Virtuoso)の結果

<http://purl.uniprot.org/uniprot/P04152>    |   GO_0003684,GO_0003684

＜原因＞

プロパティパスを使用した場合にFusekiの場合はパス(経路)が異なる場合にも始点終点の結果が同じ重複データを自動的に取り除き、Virtuosoの場合は重複していても出力する模様。 SPARQLの仕様としては重複データを出力するUniprotの方が正しい9.3 Property Paths and Equivalent Patterns
今回のケースではDISTINCTを掛けてさえいれば、結果は一緒になる。
GOのような多重継承をするオントロジーについてこのような違いが起こりうるので意識する必要がある。

GO:0003677(DNA binding)の上位クラスを辿るクエリ

PREFIX obo: <http://purl.obolibrary.org/obo/>
SELECT ?parent_go ?parent_go_label
{
  GRAPH <http://sparql.uniprot.org/go/> {
    obo:GO_0003677 rdfs:subClassOf* ?parent_go .
    ?parent_go rdfs:label ?parent_go_label
  }
} ORDER BY DESC (?parent_go)

GO:0003677(DNA binding)の階層

階層図 GO:0005488 (binding)やGO:0003674(molecular_function)に辿るには"heterocyclic compound binding"を通る経路と"organic cyclic compound binding"を通る経路の二つがある

Uniprot(Virtuoso)エンドポイントでの結果

"binding"と"molecular_function"は経路ごとに2つ結果が出てくる

parent_go                                                                parent_go_label
http://purl.obolibrary.org/obo/GO_1901363	    heterocyclic compound binding
http://purl.obolibrary.org/obo/GO_0097159	    organic cyclic compound binding
http://purl.obolibrary.org/obo/GO_0005488	    binding
http://purl.obolibrary.org/obo/GO_0005488	    binding
http://purl.obolibrary.org/obo/GO_0003677	    DNA binding
http://purl.obolibrary.org/obo/GO_0003676	    nucleic acid binding
http://purl.obolibrary.org/obo/GO_0003674	    molecular_function
http://purl.obolibrary.org/obo/GO_0003674	    molecular_function

Fusekiでの結果

"binding"と"molecular_function"は実際の経路は2つあるが、結果は一つだけ出てくる

parent_go                     parent_go_label
obo:GO_1901363        "heterocyclic compound binding"
obo:GO_0097159        "organic cyclic compound binding"
obo:GO_0005488        "binding"
obo:GO_0003677        "DNA binding"
obo:GO_0003676        "nucleic acid binding"
obo:GO_0003674        "molecular_function"