Content Association In Wikipedia

Description

Wikipedia, the free online encyclopedia, contains a wealth of intellectually and monetarily free content (in common terminology, “free as in speech and free as in beer”). The sheer number of users editing the corpus means that the majority of the articles are well-written and largely factual. However, the relationship between related articles, usually inferred by the See Also links at the bottom of each article, are generally incomplete compared to the relationships implied by words linked amidst the text of each article. We propose a PHP framework to spider Wikipedia, collecting both full-text word lists and lists containing only the words from the text of internal links. We propose comparing the relative performance of a system that attempts to find similarity metrics between articles based on the full text of each article and one based on only on the linked words in each article. This implementation uses the TF-IDF algorithm to normalize word frequency and the cosine similarity metric to rank article similarity. Please see http://www.cemetech.net/projects/item.php?id=30 for a full description and documentation, including information on installing and using this program.

Screenshots

Archive Contents

Name	Size
func_wiki.php	8.1 KB
wiki_nlp.php	6.6 KB
func_misc.php	486 bytes

Download file

File Size: 5.3 KB
Short link: http://ceme.tech/DL444

Metadata

Author: KermMartian
Uploaded: 14 years ago

Statistics

Rating: No ratings.

Downloads: 606
Views: 1640

Reviews

Nobody has reviewed this file yet.

Versions

Content Association In Wikipedia (published 14 years ago; 2010-05-05 18:17 UTC)