ProductPromotion
Logo

PHP

made by https://0x3d.site

GitHub - mvdbos/php-spider: A configurable and extensible PHP web spider
A configurable and extensible PHP web spider. Contribute to mvdbos/php-spider development by creating an account on GitHub.
Visit Site

GitHub - mvdbos/php-spider: A configurable and extensible PHP web spider

GitHub - mvdbos/php-spider: A configurable and extensible PHP web spider

Build Status Latest Stable Version Total Downloads License

PHP-Spider Features

  • supports two traversal algorithms: breadth-first and depth-first
  • supports crawl depth limiting, queue size limiting and max downloads limiting
  • supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
  • comes with a useful set of URI filters, such as robots.txt and Domain limiting
  • supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
  • supports custom request handling logic
  • supports Basic, Digest and NTLM HTTP authentication. See example.
  • comes with a useful set of persistence handlers (memory, file)
  • supports custom persistence handlers
  • collects statistics about the crawl for reporting
  • dispatches useful events, allowing developers to add even more custom behavior
  • supports a politeness policy

This Spider does not support Javascript.

Installation

The easiest way to install PHP-Spider is with composer. Find it on Packagist.

$ composer require vdb/php-spider

Usage

This is a very simple example. This code can be found in example/example_simple.php. For a more complete example with some logging, caching and filters, see example/example_complex.php. That file contains a more real-world example.

Note that by default, the spider stops processing when it encounters a 4XX or 5XX error responses. To set the spider up to keep processing, please see the link checker example. It uses a custom request handler, that configures the default Guzzle request handler to not fail on 4XX and 5XX responses.

First create the spider

$spider = new Spider('http://www.dmoz.org');

Add a URI discoverer. Without it, the spider does nothing. In this case, we want all <a> nodes from a certain <div>

$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//div[@id='catalogs']//a"));

Set some sane options for this example. In this case, we only get the first 10 items from the start page.

$spider->getDiscovererSet()->maxDepth = 1;
$spider->getQueueManager()->maxQueueSize = 10;

Add a listener to collect stats from the Spider and the QueueManager. There are more components that dispatch events you can use.

$statsHandler = new StatsHandler();
$spider->getQueueManager()->getDispatcher()->addSubscriber($statsHandler);
$spider->getDispatcher()->addSubscriber($statsHandler);

Execute the crawl

$spider->crawl();

When crawling is done, we could get some info about the crawl

echo "\n  ENQUEUED:  " . count($statsHandler->getQueued());
echo "\n  SKIPPED:   " . count($statsHandler->getFiltered());
echo "\n  FAILED:    " . count($statsHandler->getFailed());
echo "\n  PERSISTED:    " . count($statsHandler->getPersisted());

Finally we could do some processing on the downloaded resources. In this example, we will echo the title of all resources

echo "\n\nDOWNLOADED RESOURCES: ";
foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) {
    echo "\n - " . $resource->getCrawler()->filterXpath('//title')->text();
}

Contributing

Contributing to PHP-Spider is as easy as Forking the repository on Github and submitting a Pull Request. The Symfony documentation contains an excellent guide for how to do that properly here: Submitting a Patch.

There a few requirements for a Pull Request to be accepted:

  • Follow the coding standards: PHP-Spider follows the coding standards defined in the PSR-0, PSR-1 and PSR-2 Coding Style Guides;
  • Prove that the code works with unit tests and that coverage remains 100%;

Note: An easy way to check if your code conforms to PHP-Spider is by running the script bin/static-analysis, which is part of this repo. This will run the following tools, configured for PHP-Spider: PHP CodeSniffer, PHP Mess Detector and PHP Copy/Paste Detector.

Note: To run PHPUnit with coverage, and to check that coverage == 100%, you can run bin/coverage-enforce.

Support

For things like reporting bugs and requesting features it is best to create an issue here on GitHub. It is even better to accompany it with a Pull Request. ;-)

License

PHP-Spider is licensed under the MIT license.

More Resources
to explore the angular.

mail [email protected] to add your project or resources here 🔥.

Related Articles
to learn about angular.

FAQ's
to learn more about Angular JS.

mail [email protected] to add more queries here 🔍.

More Sites
to check out once you're finished browsing here.

0x3d
https://www.0x3d.site/
0x3d is designed for aggregating information.
NodeJS
https://nodejs.0x3d.site/
NodeJS Online Directory
Cross Platform
https://cross-platform.0x3d.site/
Cross Platform Online Directory
Open Source
https://open-source.0x3d.site/
Open Source Online Directory
Analytics
https://analytics.0x3d.site/
Analytics Online Directory
JavaScript
https://javascript.0x3d.site/
JavaScript Online Directory
GoLang
https://golang.0x3d.site/
GoLang Online Directory
Python
https://python.0x3d.site/
Python Online Directory
Swift
https://swift.0x3d.site/
Swift Online Directory
Rust
https://rust.0x3d.site/
Rust Online Directory
Scala
https://scala.0x3d.site/
Scala Online Directory
Ruby
https://ruby.0x3d.site/
Ruby Online Directory
Clojure
https://clojure.0x3d.site/
Clojure Online Directory
Elixir
https://elixir.0x3d.site/
Elixir Online Directory
Elm
https://elm.0x3d.site/
Elm Online Directory
Lua
https://lua.0x3d.site/
Lua Online Directory
C Programming
https://c-programming.0x3d.site/
C Programming Online Directory
C++ Programming
https://cpp-programming.0x3d.site/
C++ Programming Online Directory
R Programming
https://r-programming.0x3d.site/
R Programming Online Directory
Perl
https://perl.0x3d.site/
Perl Online Directory
Java
https://java.0x3d.site/
Java Online Directory
Kotlin
https://kotlin.0x3d.site/
Kotlin Online Directory
PHP
https://php.0x3d.site/
PHP Online Directory
React JS
https://react.0x3d.site/
React JS Online Directory
Angular
https://angular.0x3d.site/
Angular JS Online Directory