Prehladavaci robot

30. 3. 2009 21:43:18

Ahojte,

mam taky problem - chcem vytvorit robota, ktory bude indexovat webstranky - zacne jednou, vezme si odkazy, ktore na nej su a postupne preindexuje dalsie.

To v podstate nie je problem, mam problem s tym, ako spracovat data - ako zistit, co je v BODY, co je podstatne a co je bordel okolo toho - hlavicka, navigacia atd?

Mojim cielom je vytvorit nieco ako encyklopediu - identifikovat text, zaradit ho k pozadovanej teme - viete nejako poradit how to?

A este otazka - chcem, aby robot vyhladaval datumy - neviete, ako ho identifikovat v texte?

Dakujem za rady a napady

30. 3. 2009 21:43:18

https://webtrh.cz/diskuse/prehladavaci-robot/#reply273456

Martin Talavášek

(44 hodnocení)

30. 3. 2009 22:41:17

Napsal milwakee;253547
Ahojte,
mam taky problem - chcem vytvorit robota, ktory bude indexovat webstranky - zacne jednou, vezme si odkazy, ktore na nej su a postupne preindexuje dalsie.
To v podstate nie je problem, mam problem s tym, ako spracovat data - ako zistit, co je v BODY, co je podstatne a co je bordel okolo toho - hlavicka, navigacia atd?

Mojim cielom je vytvorit nieco ako encyklopediu - identifikovat text, zaradit ho k pozadovanej teme - viete nejako poradit how to?

A este otazka - chcem, aby robot vyhladaval datumy - neviete, ako ho identifikovat v texte?

Dakujem za rady a napady

Na HTML je DOM (Zend_Dom například...)

Na datum třeba regulární výrazy...

30. 3. 2009 22:41:17

https://webtrh.cz/diskuse/prehladavaci-robot/#reply273455

Fautzi

(1 hodnocení)

31. 3. 2009 12:05:18

Nebo openkapow robot maker

31. 3. 2009 12:05:18

https://webtrh.cz/diskuse/prehladavaci-robot/#reply273454

Michal Šatal

(12 hodnocení)

31. 3. 2009 14:25:26

pokud tomu nechceš věnovat skutečně hodně času a prostředků, tak s tím bude spoustu problémů a komplikací. V podstatě chceš dělat heuristickou analýzu textu, kdy budeš ale muset počítat s tím, že ne vše, co na stránce je je tématické, nebo správně zařazené do správných tagů (pokud mluvíme o html) atd. V podstatě chceš udělat google bota. :) Myslím, že jednoduše to nepůjde, protože toho bordelu ve stránkách je skutečně hodně.

31. 3. 2009 14:25:26

https://webtrh.cz/diskuse/prehladavaci-robot/#reply273453

Honzaa

31. 3. 2009 17:46:57

Napsal milwakee;253547
Ahojte,
mam taky problem - chcem vytvorit robota, ktory bude indexovat webstranky - zacne jednou, vezme si odkazy, ktore na nej su a postupne preindexuje dalsie.
To v podstate nie je problem, mam problem s tym, ako spracovat data - ako zistit, co je v BODY, co je podstatne a co je bordel okolo toho - hlavicka, navigacia atd?

Mojim cielom je vytvorit nieco ako encyklopediu - identifikovat text, zaradit ho k pozadovanej teme - viete nejako poradit how to?

A este otazka - chcem, aby robot vyhladaval datumy - neviete, ako ho identifikovat v texte?

Dakujem za rady a napady

Něco takového ?

+-------------------------------------------+

| |

| PHP Robot Class |

| |

+-------------------------------------------+

| |

| Author Name: Sam J. Clarke |

| Author Email: admin@free-php.org.uk |

| Author URI: http://www.free-php.org.uk/ |

| Description: This script is a robot class |

| to help you build web robots. |

| |

+-------------------------------------------+

| |

| If you like this, Please link back to us. |

| |

+-------------------------------------------+

LICENSE

This program is free software; you can redistribute it and/or

modify it under the terms of the GNU General Public License (GPL)

as published by the Free Software Foundation; either version 2

of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

GNU General Public License for more details.

To read the license please visit http://www.gnu.org/copyleft/gpl.html

class Robot {

var $Agent = '-'; // user agent

var $temp_everything = false; // stores what's sent back

// gets the status code returned

// returns false on fail and status code on sucsess

// you must call the get everything function first

function GetStatus()

{

$html = $this->temp_everything; // gets what was sent back

if (!$html) // check it's not false

{

return false; // if it is return false

}

$pieces = preg_split("/(rnrn|rr|nn)/", $html, 2); // split the HTML from the headers

$headers = preg_split("/(rn|n|r)/", $pieces ); // save the headers

unset($pieces); // unset everything else

for($i=0 ;isset($headers);$i++)

{

// search for the status code header

if (preg_match("/HTTP//i", $headers))

{

// replace everything but the status code

$status = preg_replace("/http/. /i", '', $headers);

}

return $status; // return the status code

}

// gets everything from the url and stores it in a string

// returns false on fail true on sucsess

function GetEverything($url)

{

$info = @parse_url ($url); // parse the url

$fp = @fsockopen($info, 80, $errno, $errstr, 10); // open a socket

if (!$fp) // check it worked

{

return false; // if it didn't return false

}

else

{

if (empty($info)) // if the path is empty

{

$info = '/'; // then set the path to /

}

if (isset($info)) // check if there is a query string

{

$query = '?' .$info; // if there is get it ready to use

}

else

{

$query = ''; // if not make an empty string

}

// HTTP headers to send

$out = "GET ".$info."".$query." HTTP/1.0rn" ;

$out .= "Host: ".$info."rn";

$out .= "Connection: close rn";

$out .= "User-Agent: ". $this->Agent."rnrn";

fwrite ( $fp, $out ); // write the HTTP headers to the socket

$html = ''; // make an empty string to store them in

while (!feof($fp)) // while not end of socket

{

$html .= fread($fp, 8192); // read from the socket and add it to the string

}

fclose($fp); // close the socket

$this->temp_everything = $html; // save the string

return true; // return true

}

// returns what was read from the socket

// returns everything that was read from the socket or false

// you must call the get everything function first

function ReturnEverything()

{

$html = $this->temp_everything; // gets what was read from the socket

return $html;

}

// gets an array of urls from the web page

// returns an array of urls or false on fail

// you must call the get everything function first

function GetUrls($url)

{

$info = @parse_url($url); // parse the url

$html = $this->temp_everything; // gets what was sent back

if (!$html) // check it's not false

{

return false; // if it is return false

}

$pieces = preg_split ("/(rnrn|rr|nn)/", $html, 2); // split the HTML from the headers

$html = $pieces; // save the HTML

unset($pieces); // unset everything else

// find all the urls

preg_match_all("|href="?'?`?(:?=&@/;._-]+)"?'?`?|i", $html, $matches);

$links = array(); // make an array to store them in

$ret = $matches;

for($i=0;isset($ret);$i++)

{

// if it starts with http:// save it without editing

if(preg_match("|^http://(.*)|i",$ret))

{

$links[] = $ret;

}

// if it matches /place.html

elseif(preg_match("|^/(.*)|i",$ret))

{

// add it to the host name and save it

$links[] = 'http://'.$info.''.$ret;

}

elseif(preg_match("|^(.*)|i",$ret))

{

// add it to the host name and save it

$links[] = 'http://'.$info.'/'.$ret;

}

// if it maches mailto:

elseif(preg_match("/^mailto:(.*)/i",$ret))

{

// could save email addresses here

}

return $links ; // return the array of links

}

// gets the headers returned

// returns false on fail headers on sucsess

// you must call the get everything function first

function GetHeaders()

{

$html = $this->temp_everything; // gets what was sent back

if (!$html) // check it's not false

{

return false; // if it is return false

}

$pieces = preg_split("/(rnrn|rr|nn)/", $html, 2); // split the HTML from the headers

return $pieces; // return the headers

}

// gets the html of a page

// returns false on fail HTML on sucsess

// you must call the get everything function first

function GetHTML()

{

$html = $this->temp_everything; // gets what was sent back

if (!$html) // check it's not false

{

return false; // if it is return false

}

$pieces = preg_split("/(rnrn|rr|nn)/", $html, 2); // split the HTML from the headers

return $pieces; // return the HTML

}

// Gets the text of a web page

// returns false on fail text on sucsess

// you must call the get everything function first

function GetTEXT()

{

$html = $this->temp_everything; // gets what was sent back

if (!$html) // check it's not false

{

return false; // if it is return false

}

$pieces = preg_split("/(rnrn|rr|nn)/", $html, 2); // split the HTML from the headers

// strip the HTML off and just leave text

$html = preg_replace('@.*?@si', ' ', $pieces);

$html = preg_replace('@.*?@si', ' ', $html);

$html = strip_tags($html);

$html = preg_replace('@&#(d+);@e', ' ', $html);

$html = str_replace('&', ' ', $html);

$html = str_replace('<', ' ', $html);

$html = str_replace('>' , ' ', $html);

$html = str_replace(' ', ' ', $html);

$html = str_replace('¡', ' ', $html);

$html = str_replace('¢', ' ', $html);

$html = str_replace('£', ' ', $html );

$html = preg_replace('@(r+|n+| +)@s', ' ', $html);

return $html; // return the text

}

31. 3. 2009 17:46:57

https://webtrh.cz/diskuse/prehladavaci-robot/#reply273452

Pro odpověď se přihlašte.

Přihlásit

Prodej Více

Předám nevyužitou firmu bez aktivity

1 Kč

0 příhozů

Prodám doménu Starflix.sk – DR 56, žádné spamové skóre!

1 299 Kč

0 příhozů

ITmix | Zavedený výdělečný affiliate projekt s kvalitním obsahem a mezinárodní expanzí

1 111 Kč

0 příhozů

Prémiová třípísmenná doména MZP.sk (historie od 2007)

1 699 Kč

0 příhozů

🔞 Ziskový e-shop s erotickými pomůckami

1 000 Kč

0 příhozů

Poptávky Více

🔧 Hľadám šikovného web developera na projekt zameraný na AliExpress!

Fresha rezervační systém

Hledám realitního tipaře pro naši realitní kancelář

Koupím starší Gmail účty

Poptávka: webové stránky, Ledeč nad Sázavou

Pracovní nabídky Více

Vývojář / Procesní inženýr s LabVIEW

Nabídky Více

Výprodej domén

⚡Webové stránky rychle, kvalitně a za skvělou cenu⚡ 3999,-Kč

Osobní přístup, žádná agentura – Marketingové služby, které nakopnou váš business

AI živý operátor pro vaše podnikání!

Úprava a správa XML feedů v aplikaci Mergado a Napojse, Shoptet, Marketplace, Amazon, Kaufland , Allegro a další.