Phasik Flux

New Project: Spyder

by Administrator on Apr.01, 2011, under Projects, Software

I decided to write a simple web spider in order to learn Python, and to generate a list of urls for webserver benchmarking & stress testing… and so Spyder was born.

Github:

https://github.com/trinitronx/Spyder

Spyder:

a simple web spider written in Python

When called on a url, it will spider the pages and any links found up to the depth specified.
After it's done, it will print a list of resources that it found.
Currently, the resources it tries to find are:

images   -  any images found on the page (ie: <img src="THIS"/>)
styles   -  any external stylesheets found on the page.  CSS included via '@import' is currently only supported if within a style tag!
(ie: <link rel="stylesheet" src="THIS"/>  OR <style>@import url('THIS');</style> )
scripts  -  any external scripts found in the page (ie: <script src="THIS"> )
links    -  any urls found on the page.  'Fragments' are discarded. (ie: <a href="THIS#this-is-a-fragment"> )
emails   -  any email addresses found on the page (ie: <a href="mailto:THIS"> )

An example script for doing something like this, 'www-benchmark.py', is included. It uses apache benchmark as an example.
Eventually I'll be experimenting with 'siege' for benchmarking & server stress-testing.

NOTE: Currently the spider can throw exceptions in certain cases (mainly character encoding stuff, but there are probably other bugs too)
Getting *working* character encoding detection is a goal, and is sorta-working... ish? Help in this area would be appreciated!
Filtering the results by domain is almost working too


Leave a Reply

You must be logged in to post a comment.

Get Adobe Flash playerPlugin by wpburn.com wordpress themes

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Visit our friends!

A few highly recommended friends...