同一文抽出に基づく類似ページの検出と分類

姜, ナウン

このアイテムのアクセス数: 2346

http://hdl.handle.net/2433/71056

このアイテムのファイル:

ファイル	記述	サイズ	フォーマット
M_Kang_Naun.pdf		2.93 MB	Adobe PDF	見る/開く

完全メタデータレコード

DCフィールド	値	言語
dc.contributor.advisor	黒橋, 禎夫	-
dc.contributor.author	姜, ナウン	ja
dc.contributor.alternative	Kang, Naun	en
dc.date.accessioned	2009-03-18T05:55:21Z	-
dc.date.available	2009-03-18T05:55:21Z	-
dc.date.created	2009-02-06	-
dc.date.issued	2009-03-23	-
dc.identifier.uri	http://hdl.handle.net/2433/71056	-
dc.description.abstract	近年，ウェブページが爆発的に増加しており，我々は検索エンジンを用いることにより多種多様な情報を得ることができる．しかし，ウェブページの約40%が類似ページといわれており，検索結果に類似ページが含まれるという問題がある．本研究では1 億ページという大規模なウェブコレクションを対象として，類似ページ検出を行なう．本研究では類似ページを，文字列をある程度共有する2つのページと定義し，ミラーページなどの同一ページ，引用ページ，盗作ページなどが含まれる．本手法はまず，各ページから長い低頻度の文を抽出する．これは，文長が長く，また，ウェブ全体での頻度が低い文を2 ページで共有すればこれらのページは関連性が高いといえるためである．また，各ページにおいてコンテンツ領域を抽出し，コンテンツ領域にある文のみを類似ページ検出の手がかりとする．これは非コンテンツ領域にある文を共有しても2 つのページに関連性が低いからである．以上の処理によって得られた文を共有するページペアを類似ページとみなす．次に，類似ページを同一ページ，引用ページ，盗作ページなどに自動分類する．分類は，ページに対する類似文字列の割合である重複率，インリンク/アウトリンクの有無，URLの類似度などの様々な情報を用いて行なう．類似ページ検出の実験を行なったところ，単純なURLの正規化ではわからないミラーページや，引用ページ，様々なサイトから記事をはりあわせたようなスパムページを発見することができた．	ja
dc.description.abstract	The recent explosive increase of Web pages has made it possible for us to obtain a variety of information with a search engine. However, by some estimates, as many as 40% of the pages on the Web are duplicates of the other pages. Thus, there is a problem that some search results contain the duplicate pages. This thesis proposes a method for detecting similar pages from a huge amount of Web pages: hundred million Japanese Web pages. Similar pages are defined as two pages that share some sentences, and are classified into mirror pages, citation pages and plagiaristic pages, etc. First, from each page, relatively long sentences are extracted. This is because two pages tend to be relevant when they share relatively long sentences. A pair of pages that has the identical sentences is regarded as similar pages. Next, similar pages are classified based on several information such as an overlap ratio, the number of inlinks/outlinks, and contents region extraction. We conducted the similar page detection and classification on the large scale Japanese Web page collection, and can find some mirror pages that we cannot find by the simple URL normalization, citation pages, and plagiaristic pages.	en
dc.format.mimetype	application/pdf	-
dc.language.iso	jpn	-
dc.publisher	京都大学	ja
dc.publisher.alternative	Kyoto University	en
dc.subject.ndc	007	-
dc.title	同一文抽出に基づく類似ページの検出と分類	ja
dc.title.alternative	Finding and Classifying Near-Duplicate Pages based on Identical Sentences Detection	en
dc.type	master thesis	-
dc.type.niitype	Thesis or Dissertation	-
dc.textversion	author	-
dc.description.degreegrantor	京都大学	ja
dc.description.degreeuniversitycode	0048	-
dc.description.degreelevel	修士	-
dc.description.degreediscipline	修士(情報学)	ja
dc.date.granted	2009-03-23	-
dcterms.accessRights	open access	-
dc.description.degreegrantor-en	Kyoto University	en
dc.description.degreeObjectType	TFtmp	-
jpcoar.contributor.Type	Supervisor	-
jpcoar.contributor.Name	黒橋, 禎夫	ja
出現コレクション:	914 修士(情報学)

アイテムの簡略レコードを表示する

Export to RefWorks

このリポジトリに保管されているアイテムはすべて著作権により保護されています。