A Crawler for the Machine-Payable Web

Posted by John Granata

A Crawler for the Machine-Payable Web

Overview

In this tutorial, we'll show you how to set up a crawler that checks the status of bitcoin-payable endpoints as a scheduled job. We call these "402 endpoints" after the famous HTTP Error 402: Payment Required that was built into web browsers from the beginning - and which we've now implemented using 21 in the context of bitcoin-payable APIs.

This crawler is a useful tool for periodically checking the status of your own hosted endpoints, or endpoints belonging to others that you want to keep an eye on. It is also a prerequisite for a more sophisticated version of the Intelligent Agents tutorial. Finally, as we will discuss at the end, it is a first step towards how a big chunk of the future World Wide Web might be indexed.

Let's jump in!

Prerequisites

Install 21

You will need the following items:

If you've got all the prerequisites, you are ready to go. Let's get started!

Step 1: Create the 402 Crawler Script

Open up a new terminal window. If you use a 21 Bitcoin Computer, ssh into it:

## Only necessary if you use a 21 Bitcoin Computer
ssh twenty@IP_ADDRESS

Create a folder to house the crawler project:

mkdir ~/402-crawler && cd ~/402-crawler

Use a text editor to create a file called crawl.py in the 402-crawler directory, and fill it with the following code:

""" 402 Crawler

    Crawl endpoints, check socket connection, and
    check 402 headers.

"""

import datetime
import logging
import socket

from two1.wallet import Wallet
from two1.bitrequests import BitTransferRequests


class Crawler402():
    """ Crawl endpoints to check status.

        Check server socket connection and query endpoints for
        price and recipient address.

    """
    def __init__(self, endpoint_list, log_file):
        """Set up logging & member vars"""

        # configure logging
        logging.basicConfig(level=logging.INFO,
                            filename=log_file,
                            filemode='a',
                            format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
                            datefmt='%m-%d %H:%M')

        self.console = logging.StreamHandler()
        self.console.setLevel(logging.INFO)
        logging.getLogger('402-crawler').addHandler(self.console)
        self.logger = logging.getLogger('402-crawler')
        self.endpoint_list = endpoint_list

    def check_endpoints(self):
        """Crawl 402 endpoints"""

        # create 402 client
        self.bitrequests = BitTransferRequests(Wallet())

        # crawl endpoints, check headers
        self.logger.info("\nCrawling machine-payable endpoints...")
        for endpoint in self.endpoint_list:

            # extract domain name
            name = endpoint.split('/', 1)[0].split('.', 1)[1]

            # get server ip
            server_ip = socket.gethostbyname(name)

            # self.logger.info("Checking {0} on port {1}".format(server_ip, port))
            self.logger.info("Checking {}...".format(endpoint))
            # configure socket module
            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            server_state = sock.connect_ex((server_ip, 80))
            sock.close()

            if server_state == 0:
                try:
                    self.logger.info("Server state: {} is up!".format(endpoint))
                    response = self.bitrequests.get_402_info('https://'+endpoint)
                    self.logger.info("Price: {}".format(response['price']))
                    self.logger.info("Address: {}".format(response['bitcoin-address']))
                except Exception:
                    self.logger.info("Could not read 402 payment headers.")
            else:
                self.logger.info("Server state: {} is down!".format('https://'+endpoint))
            self.logger.info("Timestamp: {}\n".format(datetime.datetime.now()))


if __name__ == '__main__':

    # 402 endpoints to crawl
    endpoint_list = [
        'mkt.21.co/21dotco/zip_code_data/zipdata/collect',
        'mkt.21.co/21dotco/geocode/geocode/get',
    ]

    crawler = Crawler402(endpoint_list, '402-crawler.log')
    crawler.check_endpoints()

Save the file and close it. Now let's create a scheduled job that executes the crawler code at periodic intervals.

Step 2: Create a Systemd Service

Change to your user's systemd config directory:

mkdir -p ~/.config/systemd/user
cd ~/.config/systemd/user

Create a file called 402-crawler.service and open it in your text editor. Enter the following code:

[Unit]
Description=Crawl 402 endpoints

[Service]
Type=oneshot
WorkingDirectory=/home/twenty/402-crawler
ExecStart=/usr/bin/python3 /home/twenty/402-crawler/crawl.py

Save and close the file. Now let's create a systemd timer to periodically execute this service.

Step 3: Create a Systemd Timer

Create a file called 402-crawler.timer and open it in your text editor. Enter the following code:

[Unit]
Description=Periodically trigger the 402 crawler

[Timer]
OnBootSec=1min
OnUnitActiveSec=20m

[Install]
WantedBy=timers.target

This timer will run the crawl.py script at 20 minute intervals. Save and close the file.

Step 4: Enable the Crawler Service

All that's left is to enable the service. Execute the following commands in a terminal:

systemctl --user enable 402-crawler.timer
systemctl --user start 402-crawler.timer
systemctl --user start 402-crawler.service
systemctl --user daemon-reload

That's it! You now have a crawler service enabled that will check the status of endpoints at a desired time interval. You can disable the timer at any time with:

systemctl --user stop 402-crawler.timer
systemctl --user disable 402-crawler.timer
systemctl --user stop 402-crawler.service

Next Steps

Though they are sometimes used interchangeably in conversation, the Internet is not the same as the World Wide Web (WWW). All kinds of protocols run over the Internet - SMTP, FTP, SCP, and of course HTTP and HTTPS. The World Wide Web roughly corresponds to the set of resources coded in HTML and accessible via URLs over HTTP and HTTPS (and newer protocols like SPDY).

The implementation of HTTP Error Code 402 means that we can now bill for HTTP-accessible resources in bitcoin. Analogs of 402 will need to be developed for other protocols that don't precisely fit the request/response paradigm of HTTP.

But even just in the context of the WWW, a working implementation of 402-aware servers and clients means that we have the basic tools to build a third wing of the Web. First there was just the World Wide Web, indexed by Google. Then there was the Social Web, indexed and hosted by Facebook, Twitter, and company. And now there will be the Machine-Payable Web.

While the Social Web is gated behind logins and therefore opaque to Google, the early Machine-Payable Web poses a different challenge. A crawler like the one in this example can list the prices, but to actually access the URLs will require a small amount of bitcoin per request.

This has the long-term potential to change the balance of power between search engines (especially Google) and content providers. Right now, Google's search engine crawls news websites like the Wall Street Journal and indexes them, which displeases many newspaper editors. The Wall Street Journal has wanted Google to compensate their organization for each page request for many years. Google currently counters by noting that should they desire to be dropped from Google's index, the Wall Street Journal or any other news organization need only modify their robots.txt file to block the Google crawler. Needless to say, few news organizations want to lose the traffic that Google drives their way, and so few have taken that option of self-delisting.

However, the introduction of 402-payable URLs and 402-aware web crawlers may change the balance of power. In the not-too-distant future, content providers - starting with individual blogs and then scaling up to larger websites - can set up 402 paywalls to charge a tiny amount of bitcoin to each individual visitor. This would be a negligible toll for the casual consumer of news content. However, industrial scale scraping of the internet on the scale of the GoogleBot could become very expensive indeed, as it would require that Google procure Bitcoin as a kind of digital commodity to power its search engine, much like Apple must scour the planet for rare earth elements to build its iPhones.

If you're interested in this kind of thing, here are a few possible next steps:

  • Modify an open source blogging platform like Wordpress to create a bitcoin-payable blog, and host it using 21.
  • Take the example of selling files for bitcoin and modify it to sell PDFs for bitcoin, thereby allowing the monetization of news archives
  • Set up N machine payable endpoints with time-varying prices and turn the output of this crawler into the input for a bitcoin-payable intelligent agent to make decisions about.

As always, if you build anything like this please come to the 21 Developer Community at slack.21.co to buy and sell from other developers!


How to send your Bitcoin to the Blockchain

Just as a reminder, you can send bitcoin mined or earned in your 21.co balance to the blockchain at any time by running 21 flush . A transaction will be created within 10 minutes, and you can view the transaction id with 21 log. Once the transaction has been confirmed, you can check the balance in your bitcoin wallet from the command line with wallet balance, and you can send bitcoin from your wallet to another address with wallet sendto $BITCOIN_ADDRESS --satoshis $SATOSHI_AMOUNT --use-unconfirmed. The --satoshis flag allows you to specify the amount in satoshis; without it the sendto amount is in BTC, but this behavior is deprecated and will be removed soon. The --use-unconfirmed flag ensures that you can send even if you have unconfirmed transactions in your wallet.


Ready to sell your endpoint? Go to slack.21.co

Ready to try out your bitcoin-payable server in the wild? Or simply want to browse and purchase from other bitcoin-enabled servers? Head over to the 21 Developer Community at slack.21.co to join the bitcoin machine-payable marketplace hosted on the 21 peer-to-peer network.