Zadejte hledaný výraz...

Prehladavaci robot

Martin
verified
rating uzivatele
30. 3. 2009 21:43:18
Ahojte,
mam taky problem - chcem vytvorit robota, ktory bude indexovat webstranky - zacne jednou, vezme si odkazy, ktore na nej su a postupne preindexuje dalsie.
To v podstate nie je problem, mam problem s tym, ako spracovat data - ako zistit, co je v BODY, co je podstatne a co je bordel okolo toho - hlavicka, navigacia atd?
Mojim cielom je vytvorit nieco ako encyklopediu - identifikovat text, zaradit ho k pozadovanej teme - viete nejako poradit how to?
A este otazka - chcem, aby robot vyhladaval datumy - neviete, ako ho identifikovat v texte?
Dakujem za rady a napady
30. 3. 2009 21:43:18
https://webtrh.cz/diskuse/prehladavaci-robot/#reply273456
Napsal milwakee;253547
Ahojte,
mam taky problem - chcem vytvorit robota, ktory bude indexovat webstranky - zacne jednou, vezme si odkazy, ktore na nej su a postupne preindexuje dalsie.
To v podstate nie je problem, mam problem s tym, ako spracovat data - ako zistit, co je v BODY, co je podstatne a co je bordel okolo toho - hlavicka, navigacia atd?
Mojim cielom je vytvorit nieco ako encyklopediu - identifikovat text, zaradit ho k pozadovanej teme - viete nejako poradit how to?
A este otazka - chcem, aby robot vyhladaval datumy - neviete, ako ho identifikovat v texte?
Dakujem za rady a napady
Na HTML je DOM (Zend_Dom například...)
Na datum třeba regulární výrazy...
30. 3. 2009 22:41:17
https://webtrh.cz/diskuse/prehladavaci-robot/#reply273455
Fautzi
verified
rating uzivatele
(1 hodnocení)
31. 3. 2009 12:05:18
Nebo openkapow robot maker
31. 3. 2009 12:05:18
https://webtrh.cz/diskuse/prehladavaci-robot/#reply273454
Michal Šatal
verified
rating uzivatele
(12 hodnocení)
31. 3. 2009 14:25:26
pokud tomu nechceš věnovat skutečně hodně času a prostředků, tak s tím bude spoustu problémů a komplikací. V podstatě chceš dělat heuristickou analýzu textu, kdy budeš ale muset počítat s tím, že ne vše, co na stránce je je tématické, nebo správně zařazené do správných tagů (pokud mluvíme o html) atd. V podstatě chceš udělat google bota. :) Myslím, že jednoduše to nepůjde, protože toho bordelu ve stránkách je skutečně hodně.
31. 3. 2009 14:25:26
https://webtrh.cz/diskuse/prehladavaci-robot/#reply273453
Honzaa
verified
rating uzivatele
31. 3. 2009 17:46:57
Napsal milwakee;253547
Ahojte,
mam taky problem - chcem vytvorit robota, ktory bude indexovat webstranky - zacne jednou, vezme si odkazy, ktore na nej su a postupne preindexuje dalsie.
To v podstate nie je problem, mam problem s tym, ako spracovat data - ako zistit, co je v BODY, co je podstatne a co je bordel okolo toho - hlavicka, navigacia atd?
Mojim cielom je vytvorit nieco ako encyklopediu - identifikovat text, zaradit ho k pozadovanej teme - viete nejako poradit how to?
A este otazka - chcem, aby robot vyhladaval datumy - neviete, ako ho identifikovat v texte?
Dakujem za rady a napady
Něco takového ?
/*
Copyright (C) 2005 Sam J. Clarke
All rights reserved.
+-------------------------------------------+
| |
| PHP Robot Class |
| |
+-------------------------------------------+
| |
| Author Name: Sam J. Clarke |
| Author Email: admin@free-php.org.uk |
| Author URI: http://www.free-php.org.uk/ |
| Description: This script is a robot class |
| to help you build web robots. |
| |
+-------------------------------------------+
| |
| If you like this, Please link back to us. |
| |
+-------------------------------------------+
LICENSE
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License (GPL)
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
To read the license please visit http://www.gnu.org/copyleft/gpl.html
*/
class Robot {
var $Agent = '-'; // user agent
var $temp_everything = false; // stores what's sent back
// gets the status code returned
// returns false on fail and status code on sucsess
// you must call the get everything function first
function GetStatus()
{
$html = $this->temp_everything; // gets what was sent back
if (!$html) // check it's not false
{
return false; // if it is return false
}
$pieces = preg_split("/(rnrn|rr|nn)/", $html, 2); // split the HTML from the headers
$headers = preg_split("/(rn|n|r)/", $pieces ); // save the headers
unset($pieces); // unset everything else
for($i=0 ;isset($headers);$i++)
{
// search for the status code header
if (preg_match("/HTTP//i", $headers))
{
// replace everything but the status code
$status = preg_replace("/http/. /i", '', $headers);
}
}
return $status; // return the status code
}
// gets everything from the url and stores it in a string
// returns false on fail true on sucsess
function GetEverything($url)
{
$info = @parse_url ($url); // parse the url
$fp = @fsockopen($info, 80, $errno, $errstr, 10); // open a socket
if (!$fp) // check it worked
{
return false; // if it didn't return false
}
else
{
if (empty($info)) // if the path is empty
{
$info = '/'; // then set the path to /
}
if (isset($info)) // check if there is a query string
{
$query = '?' .$info; // if there is get it ready to use
}
else
{
$query = ''; // if not make an empty string
}
// HTTP headers to send
$out = "GET ".$info."".$query." HTTP/1.0rn" ;
$out .= "Host: ".$info."rn";
$out .= "Connection: close rn";
$out .= "User-Agent: ". $this->Agent."rnrn";
fwrite ( $fp, $out ); // write the HTTP headers to the socket
$html = ''; // make an empty string to store them in
while (!feof($fp)) // while not end of socket
{
$html .= fread($fp, 8192); // read from the socket and add it to the string
}
fclose($fp); // close the socket
$this->temp_everything = $html; // save the string
return true; // return true
}
}
// returns what was read from the socket
// returns everything that was read from the socket or false
// you must call the get everything function first
function ReturnEverything()
{
$html = $this->temp_everything; // gets what was read from the socket
return $html;
}
// gets an array of urls from the web page
// returns an array of urls or false on fail
// you must call the get everything function first
function GetUrls($url)
{
$info = @parse_url($url); // parse the url
$html = $this->temp_everything; // gets what was sent back
if (!$html) // check it's not false
{
return false; // if it is return false
}
$pieces = preg_split ("/(rnrn|rr|nn)/", $html, 2); // split the HTML from the headers
$html = $pieces; // save the HTML
unset($pieces); // unset everything else
// find all the urls
preg_match_all("|href="?'?`?(:?=&@/;._-]+)"?'?`?|i", $html, $matches);
$links = array(); // make an array to store them in
$ret = $matches;
for($i=0;isset($ret);$i++)
{
// if it starts with http:// save it without editing
if(preg_match("|^http://(.*)|i",$ret))
{
$links[] = $ret;
}
// if it matches /place.html
elseif(preg_match("|^/(.*)|i",$ret))
{
// add it to the host name and save it
$links[] = 'http://'.$info.''.$ret;
}
elseif(preg_match("|^(.*)|i",$ret))
{
// add it to the host name and save it
$links[] = 'http://'.$info.'/'.$ret;
}
// if it maches mailto:
elseif(preg_match("/^mailto:(.*)/i",$ret))
{
// could save email addresses here
}
}
return $links ; // return the array of links
}
// gets the headers returned
// returns false on fail headers on sucsess
// you must call the get everything function first
function GetHeaders()
{
$html = $this->temp_everything; // gets what was sent back
if (!$html) // check it's not false
{
return false; // if it is return false
}
$pieces = preg_split("/(rnrn|rr|nn)/", $html, 2); // split the HTML from the headers
return $pieces; // return the headers
}
// gets the html of a page
// returns false on fail HTML on sucsess
// you must call the get everything function first
function GetHTML()
{
$html = $this->temp_everything; // gets what was sent back
if (!$html) // check it's not false
{
return false; // if it is return false
}
$pieces = preg_split("/(rnrn|rr|nn)/", $html, 2); // split the HTML from the headers
return $pieces; // return the HTML
}
// Gets the text of a web page
// returns false on fail text on sucsess
// you must call the get everything function first
function GetTEXT()
{
$html = $this->temp_everything; // gets what was sent back
if (!$html) // check it's not false
{
return false; // if it is return false
}
$pieces = preg_split("/(rnrn|rr|nn)/", $html, 2); // split the HTML from the headers
// strip the HTML off and just leave text
$html = preg_replace('@.*?@si', ' ', $pieces);
$html = preg_replace('@.*?@si', ' ', $html);
$html = strip_tags($html);
$html = preg_replace('@&#(d+);@e', ' ', $html);
$html = str_replace('&', ' ', $html);
$html = str_replace('<', ' ', $html);
$html = str_replace('>' , ' ', $html);
$html = str_replace(' ', ' ', $html);
$html = str_replace('¡', ' ', $html);
$html = str_replace('¢', ' ', $html);
$html = str_replace('£', ' ', $html );
$html = str_replace('©', ' ', $html);
$html = preg_replace('@(r+|n+| +)@s', ' ', $html);
return $html; // return the text
}
}
?>
31. 3. 2009 17:46:57
https://webtrh.cz/diskuse/prehladavaci-robot/#reply273452
Pro odpověď se přihlašte.
Přihlásit