Offline Rendering of WordPress Blog Posts?
I’d like to take a few tens of thousand WordPress blog posts and turn each into a standalone HTML page. I have the text that WordPress stores in its database, but there’s a problem: WordPress doesn’t (usually) store <p>…</p> paragraph tags. Instead, it (usually) interprets a blank line as a between-paragraph marker. I say “usually” because inside a <pre>…</pre> block, WordPress leaves blank lines alone. Oh, and it does funky things with tables, and… You get the picture. So what I want is a command-line tool suitable for batch processing that’ll take the text stored in the database and produce exactly the HTML that WordPress actually hands off to browsers. Problem is, I don’t speak PHP, and don’t have a couple of hours to browse the WP code base. If someone already has what I’m looking for, I’d be grateful for a pointer…
Would scraping the site be an option? You’d be able to get by without speaking PHP and you wouldn’t need to figure out how WP is rendering its tables.
I admit that this is not an answer to your question as written, but would your problem be solved with WP-SuperCache? The plugin’s stated job is to provide heavy duty caching so that WP doesn’t fall over during peak times, but one way it accomplishes that is to serve static HTML content when under load. Surely it can be prodded into doing that across all posts on the site?
The “exactly” part is what might be the problem.
If you are willing to be a bit more flexible you could export your WP to XML (provided by WP in settings) and then use whatever language you like to parse it and output in the format you want (not only HTML).
When my WP got hacked (fully updated) I settled for posting to g+ but still wanted to allow people to find the old content. To prevent re-hacking I downloaded my own content and parsed it to static HTML pages. Go hack that!
The (rather) lame code is on github:
https://github.com/zarate/wpbackup-2-html
Getting the text out is no big deal, the problem is correctly getting everything it: css, images, js, attachments… If you want it to look exactly like it does on your regular WP blog there’s a bit of work to do.
So maybe scrapping is not a bad idea!
J
Many of the caching plugins for WP do exactly that: they render a static html file of each page, to be served for non-logged-in users, and store it in a directory. When a request comes through, an .htaccess rule will check if there’s a static version of the page available and, in case there is, serve it without even invoking PHP.
WP super cache comes to mind.
you need to refine your question as your base assumption is wrong – wordpress (core) do not produce html, and it is the theme and maybe plugins that generate it.
there is also the question of how will you handle static resources which might be trivial (all image locations are relative) or not.
I recently started looking at Pelican, a static blog platform in Python. Here’s a blog post that talks about extracting data from Word Press and converting it to static html: http://www.macdrifter.com/migrating-to-pelican-extracting-wordpress-data.html
@mark k.
If it’s only the post HTML that he’s after, WordPress *does* produce the HTML, via the the_content() function.
I threw together a plugin to do this:
https://gist.github.com/3899726
It grabs all of your published posts and writes each one to a HTML file in a directory you specify. Check the header of the file for instructions on using it, and if your blog has more than 50,000 posts, you’ll need to increase the numberposts parameter on line 16.
It finished processing the 500 posts from my blog in under a second; here’s an example of what it generates: http://www.chrisfinke.com/html-blog/
(I realize that it’s not a command line tool and it doesn’t work for “offline processing,” if the blog you’re looking to export is no longer online, but trying to replicate how WordPress produces HTML without an active WordPress blog is probably more trouble than it’s worth.)
@Chris Finke : turning it into CLI tool should pretty easy:
* include wp-load.php at the top
* use $argv instead of $_GET
@chris, yes it works if you don’t mind unformatted content, losing post meta data (category, tags and publish date), and having internal links stop working