BH13.13/DataMapping
提供:TogoWiki
目次 |
データマッピング自動化
- 森(東工大), 山本(DBCLS), 岡本(DBCLS)
背景とスコープ
- 菌株メタデータへのオントロジー/統制語彙の付与を自動化
- 曖昧検索+語彙推薦
- マニュアルマッピングのコスト低減のためにすぐにでも必要
- とりあえず、既存NLPプログラムを組み合わせて処理系作る
処理系の概略
- 個別の処理ステップの明確化(分割して使える、戻れる)
- 多様なデータセット・辞書への対応
- 完全一致は自動で置き換え
- 曖昧一致はfalse positiveを許して語彙を推薦(スコアリング?)
- アルゴリズム:text scan, n-grams and cosine similarity
- 辞書:オントロジー(辞書)、マッピング正解セット
今後の課題
- 辞書にGAZ, NCBI Taxnomyを加える
- text scan時のパース粒度を調整する
- マッピング正解セットの適応
パイロットワークの評価
- データ:JCMのhabitatデータ、森さんによるマニュアルマッピング
- 手法:text scan, bigram+cos similarityで各オントロジー(辞書)に一致させた
- 結果ファイル:meoetc_match_result20140129.txt
- text scanは有効
- 文字列類似度が使えない概念的にはマッピングへは対応出来ない
TR Creosoted wooden pole manual meo wood TR Creosoted wooden pole csso no_hit TR Creosoted wooden pole pdo no_hit TR Creosoted wooden pole fma no_hit TR Creosoted wooden pole meo no_hit
- bigram+cosが有効な例
TR Buccal cavity manual fma Oral cavity TR Buccal cavity csso no_hit TR Buccal cavity pdo no_hit TR Buccal cavity fma cs Oral cavity [-1:cavity] @@ Cell cavity [-1:cavity] @@ Nasal cavity [-1:cavity] @@ Antral cavity [-1:cavity] @@ Orbital cavity [-1:cavity] TR Buccal cavity meo cs Oral cavity [-1:cavity] @@ Nasal cavity [-1:cavity]
TR Dried Japanese persimmon manual meo dried persimmon TR Dried Japanese persimmon csso no_hit TR Dried Japanese persimmon pdo no_hit TR Dried Japanese persimmon fma no_hit TR Dried Japanese persimmon meo cs dried persimmon [-1:persimmon]
TR Fermented food manual meo fermented food product TR Fermented food all ts meo:food TR Fermented food csso no_hit TR Fermented food pdo no_hit TR Fermented food fma no_hit TR Fermented food meo cs fermented food product [-1:food] @@ non-fermented food product [-1:food] @@ fermented meat [15:fermented] @@ fermented must [15:fermented] @@ fermented juice [15:fermented]
TR Fermenting molasses manual meo fermented molasses TR Fermenting molasses all ts meo:molasses TR Fermenting molasses csso no_hit TR Fermenting molasses pdo no_hit TR Fermenting molasses fma no_hit TR Fermenting molasses meo cs fermented molasses [-1:molasses] @@ fermented cane molasses [-1:molasses]
TR Fingernail manual fma Nail of finger TR Fingernail csso no_hit TR Fingernail pdo no_hit TR Fingernail fma cs Nail of finger [10625:nail] @@ Finger [12838:finger] TR Fingernail meo cs Nail of finger [10002:nail] @@ Finger [10003:finger]
- bigram+cosでfalse positiveな例
TR Dairy products manual meo dairy product TR Dairy products csso no_hit TR Dairy products pdo no_hit TR Dairy products fma no_hit TR Dairy products meo cs dairy product [-1:dairy] @@ honey product [10001:honey] @@ poultry product [10001:poultry]
TR Dust manual meo waste TR Dust csso no_hit TR Dust pdo no_hit TR Dust fma no_hit TR Dust meo cs sawdust [10001:sawdust] #単語内の文字列類似度で違う意味にマッピングされてしまった例
TR "Ear discharge, Japan" manual meo discharge TR "Ear discharge, Japan" manual fma Ear TR "Ear discharge, Japan" all ts meo:Ear @ meo:discharge TR "Ear discharge, Japan" csso cs Discharge [-1:discharge] TR "Ear discharge, Japan" pdo no_hit TR "Ear discharge, Japan" fma no_hit #短い文字列が拾えない TR "Ear discharge, Japan" meo cs discharge [-1:discharge]
TR Fermenter manual meo man-made structure TR Fermenter csso no_hit TR Fermenter pdo no_hit TR Fermenter fma no_hit TR Fermenter meo cs fermented shrimp [10001:shrimp] @@ fermented starch [10001:starch] @@ fermented meat [10003:meat] @@ fermented must [10004:must] @@ fermented juice [10005:juice] #文字類似度は高いが意味が違う
- tsとbi+cosの結果が矛盾する場合
TR "Decayed leaf of deciduous tree, Saitama Pref., Japan" manual meo leaf TR "Decayed leaf of deciduous tree, Saitama Pref., Japan" all ts meo:leaf @ fma:Tree #どっちの粒度でつける? TR "Decayed leaf of deciduous tree, Saitama Pref., Japan" csso no_hit TR "Decayed leaf of deciduous tree, Saitama Pref., Japan" pdo no_hit TR "Decayed leaf of deciduous tree, Saitama Pref., Japan" fma no_hit TR "Decayed leaf of deciduous tree, Saitama Pref., Japan" meo no_hit
TR "Decayed leaf of weed, Japan" manual meo leaf TR "Decayed leaf of weed, Japan" all ts meo:leaf TR "Decayed leaf of weed, Japan" csso no_hit TR "Decayed leaf of weed, Japan" pdo no_hit TR "Decayed leaf of weed, Japan" fma no_hit TR "Decayed leaf of weed, Japan" meo no_hit
TR "Decayed leaves, Saitama Pref., Japan" manual meo leaf #単複綴り違い問題 TR "Decayed leaves, Saitama Pref., Japan" csso no_hit TR "Decayed leaves, Saitama Pref., Japan" pdo no_hit TR "Decayed leaves, Saitama Pref., Japan" fma no_hit TR "Decayed leaves, Saitama Pref., Japan" meo no_hit