How to convert HTML content to WordPress content

About 20% of all websites run on WordPress, with more making the switch every day. Although I’ve seen many articles on how to migrate a site’s custom design into a WordPress template, virtually none address how to convert HTML content to WordPress content.

The reason for this is because raw HTML sites are each built individually, with no known structure that a CMS has, so there is no generic software that can scrape specific content correctly for WordPress. A few articles do suggest using copy/paste to enter the pages manually or to hire an engineer to make a custom conversion. Both are daunting if you have a huge website to convert.

However, with a little PHP knowledge, it can be relatively easy to convert a well structured HTML site when using the The DOMDocument class. It breaks down HTML pages using the document object model. We’ll use that to extract the info we need to create WordPress posts.

Here is our example of a raw HTML site with three levels of depth. There are a total of 6 product pages we want to make WordPress posts for. This command line PHP script will recursively parse the site  and create a new WordPress post for each product.

I’ve kept the code as simple as possible with default args for our example site. The last two args use the XPath syntax so you should be somewhat familiar with it. I put in plenty of comments and even run-time debug info so you can see how it works. You might have to customize the script if the target site has a more complicated structure.

<?php
// Name: convert_html_to_wordpress.php
// Version 1.0 Released May 11th, 2014
// imports target HTML page via DOMDocument, parses sub-pages for content and saves them as wordpress posts
// Author: https://www.superblogme.com/

require_once("./wp-load.php"); // from wordpress installation

// command line options:
// -d=debug options: true = messages only, don't create WP posts. false = create WP posts.
// -u=URL to parse for the anchor links to use creating WP posts.
// -t=target_sections: only check for anchor links in these target object(s). Refer to http://www.w3schools.com/Xpath/
// -e=page elements: for each anchor link, grab text in these elements for the post_content
// -l=limit: max amount of pages to process. default is all

$shortopts = "d::u:t::e::l::";
$options = getopt($shortopts);
$debug = (isset($options['d']) ? $options['d'] : true);
$startURL = (isset($options['u']) ? $options['u'] : "https://superblogme.com/examples/main_site.html");
$target_sections = (isset($options['t']) ? $options['t'] : "//div[@class='products']//a");
$page_elements = (isset($options['e']) ? $options['e'] : "//h1|//h2|//p");
$limit = (isset($options['l']) ? $options['l'] : 0);
if (!isset($startURL)) exit("\nUrl Required!\nUsage: $argv[0] -d=[debug true/false] -u=[url to parse] -t=[sections to target] -e=[page elements to save] =l=[limit]\n");
$parsed_urls = array(); // so we don't recheck urls and avoid an infinite loop
$cnt=0;

//////////////////////////////////////////////////////////////////////////////////////////////////////
function convert_html_to_wordpress($URL,$savepost=0,$post_title="") {
global $debug, $target_sections, $page_elements,$parsed_urls,$limit,$cnt;

 if ($debug) echo "\n===== convert_html_to_wordpress($URL,$savepost,$post_title)\n";
 $html = new DOMDocument();
 @$html->loadHTMLFile($URL); // load the target url as a DOM object

 if (!$html->documentURI) {
   if ($debug) echo "loadHTMLFile URI is null. Returning.";
   return;
 }

 if ($savepost) { // grab the page_elements from this URL to make a WP Post
 if ($debug) echo "\tGetting post_content from " . $URL . " from elements " . $page_elements . "\n";
 $post_content = ""; // clear any previous data
 $xpath = new DOMXPath($html);
 $nodes = $xpath->query($page_elements); 
 foreach($nodes as $e) { // save element as well as content
   $post_content .= "<$e->nodeName>" . $e->nodeValue . "</$e->nodeName>";
 }
 
 if ($debug) echo "Saving Post: " . $post_title . "\n"; 
 // We have our data, let's create a WordPress post with it. 
 $new_WP_post = array(
   'post_title' => $post_title, // The title of the post.
   'post_content' => $post_content, // The full text of the post.
   //'post_status' => 'publish', // Default is 'draft'.
 // if you want to insert the posts as published instead, uncomment above line
 ); 
 
 if ($debug) print_r($new_WP_post);
 // Insert the post into the database
 if (!$debug) wp_insert_post( $new_WP_post, true );
 array_push($parsed_urls,$URL); // to prevent parsing it again from another page
 
 if ($limit && ++$cnt >= $limit) exit ("Exiting... Reached our limit.\n");
 
 } // end if savepost

 if ($debug) echo "Searching " . $URL . " for: \"" . $target_sections . "\" section(s)\n";
 $xpath = new DOMXPath($html);
 $nodes = $xpath->query($target_sections);

 foreach($nodes as $link) { // grab the anchors from the target section(s)
   if ($link->hasAttribute('href')) {
     $thelink = $link->getAttribute('href');
   }
   else {
     continue;
   }

   if ($debug) echo "Found anchor link " . $thelink . "\n";
   $breakdownURL = parse_url($URL);
   $pathname = rtrim($breakdownURL['scheme'] . "://" . $breakdownURL['host'],"/");
   if (strpos($thelink,"http://") === FALSE) { // is anchor a relative url?
     if ($thelink[0] === '/') { // absolute path
       $thelink = $pathname . $thelink;
     }
     else { // relative path
       $thelink = $pathname . "/" . dirname($breakdownURL['path']) . "/" . $thelink;
     }
   }
   else if (strpos($thelink,$pathname) === FALSE) {// don't process if external url
     continue;
   }

   if (in_array($thelink,$parsed_urls)) { // already parsed this anchor so skip
     continue;
   }

   $post_title = $link->nodeValue; // let's use the anchor text as the default post title in WordPress
   if ($post_title) { // should skip anchors with no text (such as wrapped around images)
     convert_html_to_wordpress($thelink,1,$post_title);// now check this anchor page for any target_sections too
   }
 } // end foreach nodes

 if ($debug) echo "Finished reading " . $URL . "\n";
} // end function convert_html_to_wordpress
//////////////////////////////////////////////////////////////////////////////////////////////////////

convert_html_to_wordpress($startURL);
echo "\nDone!\n\n";
?>



Usage: convert_html_to_wordpress.php -d=[debug true/false] -u=[url to parse] -t=[sections to target] -e=[page elements to save] =l=[limit]

Example output of script to convert HTML content to WordPress content

How does it work? For each page it will look for any sections we define for potential content. In our example, we tell the script to look in the <div class=”products”> section. For each anchor link in that section, it will look for any content in the tags <h1>,<h2> and <p> and use that data for a new WordPress post. This is the output from the first product page found:

 

-bash-3.2$ php convert_html_to_wordpress.php -d=1 -u=https://superblogme.com/examples/main_site.html -t="//div[@class='products']//a" -p="//h1|//h2|//p"

===== convert_html_to_wordpress(https://superblogme.com/examples/main_site.html,0,)
Searching https://superblogme.com/examples/main_site.html for: "//div[@class='products']//a" section(s)
Found anchor link product1.html

===== convert_html_to_wordpress(https://superblogme.com//examples/product1.html,1,The Widgetanator)
 Getting post_content from https://superblogme.com//examples/product1.html from elements //h1|//h2|//p
Saving Post: The Widgetanator
Array
(
 [post_title] => The Widgetanator
 [post_content] => <h1> The Widgetanator </h1><h2> A description about the Widgetanator </h2><p>Lorem ipsum dolor sit amet, con
sectetur adipiscing elit. Aenean auctor arcu vel neque fringilla commodo sed eu quam. Quisque congue tincidunt sagittis. In hac ha
bitasse platea dictumst. Proin ullamcorper risus eget erat accumsan tincidunt. Vivamus at metus sit amet ligula commodo fermentum 
non non turpis. Suspendisse hendrerit luctus lectus at laoreet. Fusce nisi ligula, ullamcorper quis mollis quis, tincidunt et lacu
s. Sed tincidunt erat at nunc tristique, vehicula auctor leo venenatis. Vivamus varius elit id purus lobortis accumsan. Nullam fac
ilisis nunc id ipsum gravida pulvinar. Quisque sed eros fringilla, consectetur sem at, euismod ante. Suspendisse ut vehicula mi, n
on semper libero.</p><p>Nulla ligula purus, convallis eget est vel, semper volutpat eros. Maecenas laoreet turpis ultricies alique
t sagittis. Curabitur varius felis diam, sed sagittis mauris gravida non. Curabitur ut varius ligula. Aenean elementum tortor turp
is, at convallis metus scelerisque ut. Phasellus lacinia mauris eu est imperdiet, vitae viverra tellus mattis. Quisque quis urna m
auris. Nunc in nisi ac enim gravida condimentum. Aliquam viverra orci dui, ut laoreet libero sollicitudin ac. Cras eu volutpat ris
us.</p>
)
Searching https://superblogme.com//examples/product1.html for: "//div[@class='products']//a" section(s)
Finished reading https://superblogme.com//examples/product1.html
.
.
.

Now change the arguments for the site you want to convert. Make sure to use the XPath Syntax.

Let’s try grabbing the first entry of the ‘In the news’ section on Wikipedia using their specific HTML tags.

-bash-3.2$ php convert_html_to_wordpress.php -d=1 -u=http://en.wikipedia.org/wiki/Main_Page -t="//div[@id='mp-itn']//a" -e="//h1|//div[@id='mw-content-text']//p" -l=1

Run it with debug=true until the script extracts exactly what you want for your WordPress posts.
Run it with debug=false to save the posts into WordPress.
Open the WordPress dashboard and verify the posts are there!

 

John Holt

I’m a software engineer and web designer in Fort Pierce, Florida (winter) and Franklin, New Hampshire (summer). Super Blog Me is a space where I can brainstorm my ideas on WordPress development, interface design and other things I do with my life.

Leave a Reply

Your email address will not be published. Required fields are marked *