PunkSCAN 1.2.x Deployment Guide

Official Guide


What PunkSCAN is

PunkSCAN is a ridiculously stable and fast distributed mass web application scanner. It is intended to deeply scan a massive number of targets, looking for common vulnerabilities. We use it to scan the entire Internet through our PunkSPIDER project where we are using PunkSCAN to scan several million URLs for security issues. It can also be set up to continuously scan said targets and handle errors gracefully without compromising the rest of the job. It works by leveraging an existing Hadoop cluster to greatly improve scan time, depth, performance, and stability.

It is meant to scan a massive amount of targets in a short amount of time. It is not the most thorough scanner in terms of vulnerabilities that it checks for (yet), but if you're looking to get quick information on a massive number of sites' basic security measures then PunkSCAN is for you.

What PunkSCAN is not

PunkSCAN is not a "simple" web application scanner for scanning a few sites or even just one or two domains. If you're looking to scan a single target or just a couple of targets, I recommend looking elsewhere, as the PunkSCAN setup is meant to scale massively, and would be a bit heavy handed for just a few sites.


PunkSCAN works by using a Hadoop cluster and interacting with a series of URLs indexed to Solr for targets. It is assumed, for this release, that you have an active Hadoop cluster that you can use. We recommend using at least Hadoop 1.0.0 when running PunkSCAN, as this is the release version that we use and have done extensive testing with. Solr testing occurred with Solr v3.6 and we recommend using at least this version of Solr. For convenience we have provided two preconfigured Solr instances that you can use.

For targeting and reporting, we are using two Solr cores, each with its own unique schema. One is the Solr Summary schema and the other is the Solr Details schema. Preconfigured Solr instances are provided to you with a release package, so you don't have to worry about downloading and configuring Solr. What we call the "Solr Summary" instance provides summary information on a target hostname. Solr Summary also initially holds targets that you load into the system, and these target Solr records will be updated as scanning occurs against them. The Solr Details schema initially holds nothing, and records will be added with details on the vulnerabilities found on a particular web application located at your target hostnames.

Any questions about this guide or punkSCAN? Join the mailing list by sending an email to punkspider@hyperiongray.com requesting to join the mailing list. Make sure you put something in there that indicates you're a human (a creative sentence or perhaps a haiku is always nice), or our spam filters will block you.

Before you start

This guide and PunkSCAN has been tested with various Hadoop versions > 1.0, Ubuntu 12.04+, and Oracle Java 7+. The following should work just fine for you Fedora/Red Hat folks along with any other Debian-based distros, but we are looking for additional reports of this. If you happen to test this on Red Hat/Fedora/Debian, please let us know by emailing punkspider@hyperiongray.com.

Installing and Running PunkSCAN

Even though it might sound like there's a lot going on, we've provided enough for you "out of the box" that you should be able to get up and running within about 20 minutes, maybe less. The following are the steps to get going

Install Dependenices

(0) If you don't have a Hadoop cluster already, we recommend these two guides by Michael Noll to getting your very own little distributed computing cluster going. Once you start your cluster, you are ready to install and run PunkSCAN. Ensure that you have the following properties set in mapred-site.xml:




PunkSCAN uses the defaults in the Hadoop cluster to determine the number of map and reduce tasks allowed. In general you should set the reduce tasks to 2x the number of machines in your cluster, and mapred.map.tasks to 10x the number of machines in your cluster.

(1) Install python-paramiko and ant: On the Hadoop master node that you'd like to administrate scans from you're going to need the python-paramiko library and ant. On Debian-based distros this can be installed with:

sudo apt-get install python-paramiko ant

(2) Get the latest punkscan release. This can be downloaded from bitbucket:

git clone https://bitbucket.org/punkspider/punkscan.git

or you can use the release tar ball from here and decompress it.

(3) Upgrade to the latest version of lxml on each node in the cluster (this is the only step required on each node in the cluster)

sudo apt-get install libxml2-dev libxslt-dev python-dev lib32z1-dev
sudo pip install lxml

Install PunkSCAN

(1) in the root directory of the project you'll find a file named install.run, run it:


   This will compile Apache Nutch, the web crawler that we will use to collect URLs to fuzz, install a couple of python libraries for you, and configure Apache Nutch for you. PunkSCAN is now installed.

Configure Nutch and PunkSCAN

(1) Configure a proxy:  We strongly recommend using a proxy with PunkSCAN.  In order to do so, open up nutch/conf/nutch-site.xml and edit the fields that look like this:

  <description>The proxy hostname.  If empty, no proxy is used.</description>

  <description>The proxy port.</description>

To those of your own proxy that you'd like to use for crawling. Once you're finished run the nutch_reconfig bash script located in nutch/runtime/deploy. From the nutch/conf directory:

cd nutch/runtime/deploy

The previous steps configure a proxy for site crawling, but you can configure a separate proxy for fuzzing. Note that we only support HTTP proxies. In order to configure one open up punkscan/punk_fuzzer/fuzzer_config/punk_fuzz.cfg.xml and change the line that looks like this:


To those of the proxy that you'd like to use for fuzzing.

(2) Final configurations

Now you're ready to configure PunkSCAN. Pop open punkscan/punkscan_configs/punkscan_config.cfg. Below is an example config with explanations of each field:


HADOOP_HOME = /usr/local/hadoop #Set this to your HADOOP_HOME, this should be the same as the output of echo $HADOOP_HOME from a shell
NUTCH_HOME =  /usr/local/punkscan/nutch #Set this to your NUTCH_HOME, this should be the full path of the nutch directory in the 
										#root directory of your punkSCAN download.


sim_urls_to_scan = 120 #set this to the number of simultaneous targets that you'd like to be scanned per each round of scanning.
				       #A reasonable default is ~8-10 sites per machine in your cluster 

depth = 3 #The depth to crawl to, a reasonable default for a quick scan is two, for a deeper scan is three or four.

topN = all #The number of links to collect on each "depth" level. Set this to all to collect everything at each depth. A reasonable 
		   #default if you want "quick" #scans is about 30-50 per each sim_urls_to_scan, for fairly quick scans is about 100-120 per
		   #each sim_urls_to_scan.


#note these URLs must be accessible by every machine in your Hadoop cluster!

solr_details_url = http://<Solr Details ip>:8984/solr #Set this to your solr details URL. This is the default if you are using our preconfigured Solr instances.
solr_summary_url = http://<Solr Summary ip>:8983/solr #Set this to your solr summary URL. This is the default if you are using our preconfigured Solr instances.


hadoop_user = pgotsr #The user that is running Hadoop


# csv list of hadoop datanodes - not including the machine that punkscan
# is being run from

datanodes = punkscan-slave1,punkscan-slave2,punkscan-slave3 # A comma separated list of all of your Hadoop slaves

Once you're done with that open up punkscan/punk_fuzzer/fuzzer_config/punk_fuzz.cfg.xml and insert your solr details and summary url in the following lines:

      <detail_url><![CDATA[http://<your Solr details ip>:8984/solr]]></detail_url>
      <summary_url><![CDATA[http://<your Solr summary ip>:8983/solr]]></summary_url>

Load Targets

(1) If you're running Solr on the same machine as PunkSCAN with our preconfigured Solr instances, start the Solr service by doing the following from the root directory of the project:

cd solr

Otherwise start Solr on your remote servers - ensure that if you're using your own Solr instances that you have copied our summary and detail schema.xml files to your Solr instance. These can be found in the solr/ directory.

(2) Now it's time to load targets into the system. A target should be a URL with a hostname such as the following: http://www.hyperiongray.com/ or http://punkspider.hyperiongray.com/ but should not contain subfolder indicators like http://www.hyperiongray.com/index.php/about-us/about-us. There's a utility in the root folder called targets.py that will help you get your targets into the system. Targets should be imported from a file in the form <url>;<title> (without the <>) as the input with one entry per line. Here is the usage for that script and an example targets file:

python targets.py targs.csv

Where targs.csv looks like the following*:


http://hyperiongray.com/;Hyperion Gray, LLC
http://punk.hyperiongray.com/;PunkSPIDER Home

*Note due to a bug in targets.py blank lines following the targets are not allowed. This is a known issue and will be fixed. Also note that running targets.py WILL overwrite records with the same id (the URL). This means you will lose your vscan_tstamp field of when a site was scanned. This is not recommended. In general, you should only load targets once.

Here's an example record from Solr Summary immediately after you load it:

  <str name="id">http://www.hyperiongray.com/</str>
  <str name="title">Hyperion Gray, LLC</str>
  <date name="tstamp">2012-09-18T15:19:23.874Z</date>
  <str name="url">http://www.hyperiongray.com/</str></doc>

Run PunkSCAN

(1) You're finally ready to run punkSCAN! cd to the punkscan/punk_fuzzer directory and you'll see the punk_fuzz.run script. You have the following options for running this:

With the -d flag (distributed single round of scans):

./punk_scan.run -d

The -d flag will run a single batch of scans. In other words it will scan sim_urls_to_scan (from punkscan_config.cfg) urls one time and then stop.


With the -c flag (distributed continuous scans):

./punk_scan.run -c

The -c flag will continuously run punkSCAN until you press Ctrl+C many times in a row to kill it. It will take sim_urls_to_scan number of URLs, scan them, and then move on to the next ones. Once it is done, it will take the ones that were scanned longest ago and scan them again and continue forever.

Getting your results

Results are indexed back to your Solr Summary and Solr Details instances. In Solr Summary a scanned document will look similar to the following:

  <int name="bsqli">0</int>       <!-- This field is the number of blind SQL injection bugs found -->
  <str name="id">http://bibliotecaudd.cl/</str>
  <int name="sqli">0</int>     <!-- This field is the number of SQL injection bugs found -->
  <str name="title">Biblioteca UDD - Recursos de Información y Bibliotecas</str>
  <date name="tstamp">2012-08-10T11:34:59.305Z</date>
  <str name="url">http://bibliotecaudd.cl/</str>
  <date name="vscan_tstamp">2012-09-11T22:24:37.97Z</date>  <!-- This field is the timestamp of approx. when the batch job to scan this site was kicked off -->
  <int name="xss">0</int> <!-- This field is the number of XSS bugs found -->

In other words, summary information.

If a bug is found at a URL it outputs a details entry into the Solr Details instance. These entries look something like the following:

  <str name="bugtype">xss</str>
  <str name="id">com.fc2.blog16.esthenews.3</str>
  <str name="info">XSS</str>
  <str name="parameter">q=%3C%2Ftitle%3E%3CScRiPt%3Ealert%28%275dpfthxpju%27%29%3C%2FsCrIpT%3E&s=y&charset=utf-8&range=on&is_adult=true</str><str 
  <str name="url_main">com.fc2.blog16.esthenews</str>
  <str name="v_url">http://esthenews.blog16.fc2.com/blog-category-24.html?q=</title><ScRiPt>alert('5dpfthxpju')</sCrIpT>&s=y&charset=utf-8&range=on&

In other words, detail information. You should now have a running version of PunkSCAN, and be able to interpret the data coming from it. If you have any questions please feel free to send an email requesting to join the punkSCAN mailing list at punkspider@hyperiongray.com.


Additional Notes

Testing Your PunkSCAN Build

  1. It is recommended that you test your proxy before running PunkSCAN. This can be done with the curl command. curl -x "http://proxy:port" "ifconfig.me", ensure that you get your proxy IP back properly.
  2. We recommend starting small, with a few URLs first, ensure that everything gets indexed properly, then moving on to larger scans.

Additional things to check

If you feel the burning urge to tweak the payloads used in fuzzing look at and edit the punkscan/punk_fuzzer/fuzzer_config/punk_fuzz.cfg.xml  file.  Please be careful with these settings, and we recommend reading up on what each of them does and how we are detecting vulnerabilities before attempting any major changes - note most of these get mutated in various ways before getting delivered, so you may be duplicating efforts if you add your own. Details can be found by reading the code in punkscan/punkscan/punk_fuzzer/punk_fuzz.py or ask us a question through the PunkSCAN mailing list and we can help you out.

Other Notes

At the beginning of a scan, you will see this:


13/01/08 06:28:07 WARN crawl.Crawl: solrUrl is not set, indexing will be skipped...

 even if your Solr URL is set in the config. This is a Nutch thing, don't worry about it.


If you get the following messages when starting punkscan


13/01/21 01:19:02 INFO ipc.Client: Retrying connect to server: localhost/ Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

13/01/21 01:19:03 INFO ipc.Client: Retrying connect to server: localhost/ Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

 Then your Hadoop cluster is not running properly. Make sure that your Hadoop services are properly running.