Upon making a search engine one of the biggest feats is html document duplication. Doing a md5 just dosen't work because there are small changes in text of the html itself. So i searched for a way to find this in mysql with not much luck. However I found with a little persistence I was able to produce a rather accurate way to find document duplication.
Here is the table structure used.
CREATE TABLE IF NOT EXISTS `documents` (
`document_id` int(10) unsigned NOT NULL auto_increment,
`contents` text NOT NULL,
`url` varchar(255) NOT NULL,
PRIMARY KEY (`document_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1;
$md5s = array();
$documents = new mysqli_db_table('documents');
foreach($links as $link)
{
$html = file_get_contents($link);
$md5 = md5($html);
if($md5s[$md5]) continue;
$documents->insert(array(
'url' => $link,
'contents' => SQL('SOUNDEX(' . $documents->filter(strip_tags($html)) . ')')
));
$md5s[$md5] = $md5;
}
$db = mysqli_db::init();
$docs = $db->fetch_all('SELECT d1.url, GROUP_CONCAT( d2.document_id ) AS d2_ids, d1.document_id AS d1_id
FROM documents AS d1
LEFT JOIN documents AS d2 ON d2.contents = d1.contents
AND d2.document_id != d1.document_id
GROUP BY d1.document_id');
foreach($docs as $doc)
{
if($doc['d2_ids'])
{
$db->query('DELETE FROM documents WHERE document_id IN (' . $doc['d2_ids'] . ')');
}
}
« Back to my notebook