文字列マッチ

今回の処理対象データの特徴として、数値が意味をなさない場合が多く、アルファベット部分の類似度に重きを置くことが有効であった。そこで、Apache Commonsライブラリに含まれているDoubleMetaphoneアルゴリズムを利用している。

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Iterator;
import java.util.TreeSet;
import java.util.regex.Pattern;

import org.apache.commons.codec.language.DoubleMetaphone;

public class getDoubleMetaphone {

     public static void main(String[] args) {
             DoubleMetaphone dmp = new DoubleMetaphone();
         try{
           　BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
             String str;
             while ((str = in.readLine()) != null) {
                     String[] parts = str.split("\t");
                     System.out.println(str+'\t'+dmp.doubleMetaphone(parts[1],false));
             }
         } catch (IOException e){
             System.exit(0);
         }
     }
}

オントロジーマッチ

関連オントロジーを、所与の文字列にマッチさせる。

オントロジー

マッチさせるオントロジーは以下の通り

EnvO (Environment Ontology) - http://environmentontology.org/
FMA (Foundational Model of Anatomy) - http://sig.biostr.washington.edu/projects/fm/AboutFM.html
Plant Ontology - http://www.plantontology.org/
OGR (Ontology of Geographical Region) - http://bioportal.bioontology.org/ontologies/1087
OBIS (Ocean Biogeographic Information System) - http://www.iobis.org/
Gazetteer - http://bioportal.bioontology.org/ontologies/1397

ID-ターム対応ファイル生成

OBISはダウンロードが単純ではなく、取得は見送り。残りの5つについて、それぞれのファイル形式は下記の通りなので、各形式について対応。

オントロジー	ファイル形式
EnvO	obo
FMA	MySQLダンプ
Plant Ontology	obo
OGR	owl
Gazetteer	obo

OBO形式

EnvO、Plant Ontology、GazetteerはOBOフォーマットが取得出来るので、PerlプログラムでTermとIDの組を取得。

use warnings;
use strict;

my $skipflag = 0;
my $termid = "";
while(<>){
   chomp;
   if(/^$/){
       $skipflag = 0;
       next;
   }
   next if $skipflag;
   if(index($_, "[") == 0){
       $termid = "";
       my ($cat) = $_ =~ m/^\[(\w+)\]/;
       if($cat ne "Term"){
           $skipflag = 1;
           next;
       }
   }
   if(index($_, "id: ") == 0){
       substr($_, 0, 4, "");
       $termid = $_;
   }
   next unless index($_, "name: ") == 0;
   substr($_, 0, 6, "");
   print join("\t", ($termid, $_)), "\n";
}

MySQLダンプ形式

FMAはMySQLのダンプが取得出来るので、これをそのままロードする。

+-------------+----------------+------+-----+---------+-------+
| Field       | Type           | Null | Key | Default | Extra |
+-------------+----------------+------+-----+---------+-------+
| frame       | varbinary(500) | NO   | MUL | NULL    |       | 
| frame_type  | smallint(6)    | NO   |     | NULL    |       | 
| slot        | varbinary(500) | NO   | MUL | NULL    |       | 
| facet       | varbinary(500) | NO   |     | NULL    |       | 
| is_template | bit(1)         | NO   |     | NULL    |       | 
| value_index | int(11)        | NO   |     | NULL    |       | 
| value_type  | smallint(6)    | NO   |     | NULL    |       | 
| short_value | varchar(500)   | YES  | MUL | NULL    |       | 
| long_value  | mediumtext     | YES  |     | NULL    |       | 
+-------------+----------------+------+-----+---------+-------+

'SELECT frame,slot,short_value FROM fma WHERE slot="name" OR slot="FMAID"'

`frame`をキーとした構造になっているので、上記のクエリを発行して取得し、PerlスクリプトでTermとIDの組を生成。一部のターム (3字未満、数値のみ、数値の後にアルファベット一文字、大文字一文字に続いて数値) は予めフィルタして出力。

use warnings;
use strict;

my (%fmaids, %terms);

while(<>){
   chomp;
   my ($frame, $slot, $short) = split /\t/;
   if($slot eq 'FMAID'){
       $fmaids{$frame} = $short;
   }elsif($slot eq 'name'){
       $terms{$frame} = $short;
   }
}

while(my ($frame, $fmid) = each %fmaids){
   next unless $terms{$frame};
   my $term = $terms{$frame};
   next if length($term) < 3 || $term =~ /\d+$/ || $term =~ /\d+[A-Za-z]$/ || $term =~ /^[A-Z]+\d+ /;
   print join("\t", ($fmid, $term)), "\n";
}

その他

OGRはowl形式のみ取得可能だが、rdfs:label属性が無く、文字列そのものをIDとし、rdfs:comment属性値にMeSHの地理情報に関する(Z01木)IDが書かれているのみなので、これとは別に予めDBCLSで取得しているMeSH2011 (mtrees2011.bin) を対象としてZ01部分を抽出。

grep ';Z0' mtrees2011.bin | perl -pe 's/;/\t/'