12
Jun
Effective & Fast HTML Document Duplication Duplication
By Joseph Montanez
0 Comment

Upon making a search engine one of the biggest feats is html document duplication. Doing a md5 just dosen't work because there are small changes in text of the html itself. So i searched for a way to find this in mysql with not much luck. However I found with a little persistence I was able to produce a rather accurate way to find document duplication.

Here is the table structure used.

CREATE TABLE IF NOT EXISTS `documents` (
  `document_id` int(10) unsigned NOT NULL auto_increment,
  `contents` text NOT NULL,
  `url` varchar(255) NOT NULL,
  PRIMARY KEY  (`document_id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=1;

First we index all html content with Mysql's built in SOUNDEX() function.
$md5s = array();
$documents = new mysqli_db_table('documents');
foreach($links as $link)
{
	$html = file_get_contents($link);
	$md5 = md5($html);
	if($md5s[$md5]) continue;
	$documents->insert(array(
	    'url' => $link,
	    'contents' => SQL('SOUNDEX(' . $documents->filter(strip_tags($html)) . ')')
	));
	$md5s[$md5] = $md5;
}

Now that those html pages are indexed with SOUNDEX as can run a query to find all SOUNDEX's that are the same and remove them.
$db = mysqli_db::init();

$docs = $db->fetch_all('SELECT d1.url, GROUP_CONCAT( d2.document_id ) AS d2_ids, d1.document_id AS d1_id
FROM documents AS d1
LEFT JOIN documents AS d2 ON d2.contents = d1.contents
AND d2.document_id != d1.document_id
GROUP BY d1.document_id');

foreach($docs as $doc)
{
    if($doc['d2_ids'])
    {
        $db->query('DELETE FROM documents WHERE document_id IN (' . $doc['d2_ids'] . ')');
    }
}

« Back to my notebook
Next Note > < Previous Note
Comment Pages: 1


esign