オンライン・データベースをインデックスするdeeppeep

すべてのオンライン・データベースをインデックスする野心的なプロジェクトであるディープピープdeeppeep。googleの検索エンジンの弱点はオンライン上のデータベースなどのディープウェブのインデックスにある。googleは現在部分的にデータベースに対してクエリーを投げ、返ってきたコンテンツを抽出してインデックスする自動化されたシステムを稼働しているがまだまだ完全ではない。googleがインデックスする浅いウェブの先にはさらに広大なディープウェブが広がっている。金融データやショッピング・カタログやフライト・スケジュールや医療リサーチ・データなどの隠れたデータは大部分がまだ検索エンジンから見えていないのだ。もしディープウェブがインデックスされれば検索エンジンは「次の木曜発ニューヨークからロンドンへの一番安いチケットは?」といった質問に容易に答えられるようになる。ユタ大学のJuliana Freire教授により進められているdeeppeepはどんなデータベース上での90%以上のコンテンツを抽出することが可能と謳っていて、最近も、とある有名検索エンジンからの誘いを受けたという。またディープウェブの検索が可能になることでビジネスでの検索の使い方も違ったものになる。例えば健康サイトでは各製薬会社の最新医療情報のデータをクロスレファレンスして提示したり、ローカル・ニュースサイトでは政府の公表されているデータベースを融合することなどが考えられる。ディープウェブという言葉を生み出したMike Bergmanもディープウェブの意味をウェブ・サーファーの気まぐれを満足させるよりもビジネスをがらりと変えることに見いだしている。

Beyond those trillion pages lies an even vaster Web of hidden data: financial information, shopping catalogs, flight schedules, medical research and all kinds of other material stored in databases that remain largely invisible to search engines. The challenges that the major search engines face in penetrating this so-called Deep Web go a long way toward explaining why they still can't provide satisfying answers to questions like "What's the best fare from New York to London next Thursday?" The answers are readily available - if only the search engines knew how to find them. [...] In a similar vein, Prof. Juliana Freire at the University of Utah is working on an ambitious project called DeepPeep (www.deeppeep.org) that eventually aims to crawl and index every database on the public Web. Extracting the contents of so many far-flung data sets requires a sophisticated kind of computational guessing game.

http://www.nytimes.com/2009/02/23/technology/internet/23search.html?_r=1&ref=technology

Halevy, who heads the "Deep Web" search initiative at Google, described the "Shallow Web" as containing about 5 million web pages while the "Deep Web" is estimated to be 500 times the size. This hidden web is currently being indexed in part by Google's automated systems that submit queries to various databases, retrieving the content found for indexing. In addition to that aspect of the Deep Web - dubbed "vertical searching"

http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php