How to scrape an old PHP (or whatever) site with wget for use in Nginx

Wednesday, September 9, 2015

If you’re like me, in your youth you once made websites with PHP that have uncool URLs like /index.php?seemed-like-a-good-idea=at-the-time. Well, time has passed and now you want to stop using Apache, MySQL, and PHP on your LAMP server, but you also don’t want to just drop your old website entirely off the face of the internet. How can you migrate your old pages to Nginx?

The simple solution is to use wget. It’s easy to install on pretty much any platform. (On OS X, try installing it with homebrew.) But there are a few subtleties to using it. You want to keep your ugly old URLs with ? in them working, even though you don’t want them to be dynamically created from a database any more. You also want to make sure Nginx serves your pages with the proper mime-type of text/html because if the mime-type is set incorrectly, browsers will end up downloading your pages instead of displaying them.

Here’s what to do.

First, use FTP or whatever to copy the existing site onto your local machine. (These are old sites, right? So you don’t have version control do you? 😓) This step is to ensure you have all the images, CSS files, and other assets that were no doubt haphazardly scattered throughout your old site.

Next, go through and delete all the *.php files and any hidden .whatever files, so Nginx doesn’t end up accidentally serving up a file that contains your old Yahoo mail password from ten years ago or something because it seemed like a good idea to save your password in plaintext at the time.

Now, cd into the same directory as your copy of the files on the server and use this command with wget to add scraped copies of your dynamic pages:

    wget \
         --recursive \
         --no-clobber \
         --page-requisites \
         --html-extension \
         --domains example.com \
         --no-parent \
             www.example.com/foobar/index.php/

Here’s what the flags mean:

--recursive is so you scrape all the pages that can be reached from the page you start with.
--no-clobber means you won’t replace the files you just fetched off the server.
--page-requisites is somewhat redundant, but it will fetch any asset files you may have missed in your copy from the server.
--html-extension is a bit of a wrinkle: it saves all the files it fetches with a .html extension. This is so that Nginx will know to serve your pages with the correct mimetype.
--domains example.com and --no-parent are so you only scrape a portion of the site that you want to scrape. In this case, the root of example.com would be left alone. Your case may be different.
The final argument is the address of the page to start fetching from.

wget will save these pages with two wrinkles that you’ll need to tell Nginx about. First, as mentioned, Nginx needs to know to ignore the .html on the end of the file names. Second, you’ll need to be able to serve up URLs with ? in the file name. To do both of those things, in Nginx, add this directive to the server block for your new thingie try_files $uri $uri/index.html $request_uri.html =404;. try_files tells Nginx to try multiple files when serving a URL in the order specified. $uri is the plain URL (e.g. for your CSS/JS/image assets), $uri/index.html serves up index pages, which wget will create whenever a URL ends in a slash. $request_uri.html serves up files including ? in the middle with a final .html as was appended by wget.

Here’s a minimally complete Nginx configuration example:

    http {

        server {
            server_name www.example.com;
            server_name  example.com;

            listen 80;

            root /path/to/sites/example;
            error_page 404 403 /404.html;
            try_files $uri $uri/index.html $request_uri.html =404;
        }
    }

See the Nginx HTTP server boilerplate configs project for a complete example. (Note that this example assumes you have a 404.html to serve up for missing pages.)

The Ethically-Trained Programmer