Valid XHTML Icon Valid CSS Icon Valid SVG Icon Unicode UTF-8 Icon
Add to del.icio.us Digg this! Post to reddit Share on Facebook Add to StumbleUpon Add to Google Bookmarks

HTML to PDF

Creative Commons LogoThis work is licensed under the Creative Commons Attribution Non-Commercial 2.0 UK: England & Wales Licence. This means that you are free to: copy; distribute; and modify this work. It also means that you cannot use it for commercial purposes. Additionally, you must attribute this work to the original author, Thomas Guymer, ideally with a link.

This is just a brief work-through. I am not going to show you how to do it as it took me ages to fine-tune it all and I'm not ready to release it all yet. Consider this a 'proof of concept'. It will show you a few stages in the process I use to convert HTML to PDF.

First, consider this HTML source code:

<h1>Document Title</h1>
<p class="attention"><img src="/graphics/creativeCommons.png" alt="Creative Commons Logo" width="256" height="60" />This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/2.0/uk/" title="Creative Commons Attribution Non-Commercial 2.0 UK: England &amp; Wales Licence" class="external">Creative Commons Attribution Non-Commercial 2.0 UK: England &amp; Wales Licence</a>. This licence is provided by <a href="http://www.creativecommons.org/" title="Creative Commons" class="external">Creative Commons</a>. This means that <b>you are free</b> to: copy; distribute; and modify this work. It also means that <b>you cannot</b> use it for commercial purposes. Additionally, you must attribute this work to the original author, <em><a href="/who/" title="Who is Thomas Guymer?">Thomas Guymer</a></em>. As this licence requires attribution <b>it is not compatible with any of the <a href="http://www.gnu.org/licenses/licenses.html" title="GNU Licenses" class="external"><abbr title="GNU&apos;s Not Unix">GNU</abbr> Licenses</a></b>.</p>
<h2>Chapter Title</h2>
<p>Lorem <em>ipsum</em> dolor sit amet, consectetur adipisicing elit, sed do <q>eiusmod</q> tempor incididunt ut labore et dolore magna <b>aliqua</b>. Ut enim ad minim veniam, quis nostrud <i>exercitation</i> ullamco laboris nisi ut aliquip ex ea commodo <a href="http://www.google.co.uk" title="Google Search Engine">consequat</a>. Duis aute irure dolor in reprehenderit in <code>voluptate</code> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat <small>cupidatat</small> non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<h3>Section Title</h3>
<p>Lorem <em>ipsum</em> dolor sit amet, consectetur adipisicing elit, sed do <q>eiusmod</q> tempor incididunt ut labore et dolore magna <b>aliqua</b>. Ut enim ad minim veniam, quis nostrud <i>exercitation</i> ullamco laboris nisi ut aliquip ex ea commodo <a href="http://www.google.co.uk" title="Google Search Engine">consequat</a>. Duis aute irure dolor in reprehenderit in <code>voluptate</code> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat <small>cupidatat</small> non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<h3>Section Title</h3>
<p>Lorem <em>ipsum</em> dolor sit amet, consectetur adipisicing elit, sed do <q>eiusmod</q> tempor incididunt ut labore et dolore magna <b>aliqua</b>. Ut enim ad minim veniam, quis nostrud <i>exercitation</i> ullamco laboris nisi ut aliquip ex ea commodo <a href="http://www.google.co.uk" title="Google Search Engine">consequat</a>. Duis aute irure dolor in reprehenderit in <code>voluptate</code> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat <small>cupidatat</small> non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<h2>Chapter Title</h2>
<p>Lorem <em>ipsum</em> dolor sit amet, consectetur adipisicing elit, sed do <q>eiusmod</q> tempor incididunt ut labore et dolore magna <b>aliqua</b>. Ut enim ad minim veniam, quis nostrud <i>exercitation</i> ullamco laboris nisi ut aliquip ex ea commodo <a href="http://www.google.co.uk" title="Google Search Engine">consequat</a>. Duis aute irure dolor in reprehenderit in <code>voluptate</code> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat <small>cupidatat</small> non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<h3>Section Title</h3>
<p>Lorem <em>ipsum</em> dolor sit amet, consectetur adipisicing elit, sed do <q>eiusmod</q> tempor incididunt ut labore et dolore magna <b>aliqua</b>. Ut enim ad minim veniam, quis nostrud <i>exercitation</i> ullamco laboris nisi ut aliquip ex ea commodo <a href="http://www.google.co.uk" title="Google Search Engine">consequat</a>. Duis aute irure dolor in reprehenderit in <code>voluptate</code> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat <small>cupidatat</small> non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<h3>Section Title</h3>
<p>Lorem <em>ipsum</em> dolor sit amet, consectetur adipisicing elit, sed do <q>eiusmod</q> tempor incididunt ut labore et dolore magna <b>aliqua</b>. Ut enim ad minim veniam, quis nostrud <i>exercitation</i> ullamco laboris nisi ut aliquip ex ea commodo <a href="http://www.google.co.uk" title="Google Search Engine">consequat</a>. Duis aute irure dolor in reprehenderit in <code>voluptate</code> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat <small>cupidatat</small> non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>

...which produces this HTML document:

ERROR: Page not displayed. Try opening the example page.

Note the strong clear header structure:

The HTML is converted to LaTeX via simple string-replacement functions in PHP which produces this LaTeX source code:

% To create a pdf type pdflatex THIS_FILENAME.tex. The command must be typed twice so that the contents pages can be created.

% Declarations
\documentclass[a4paper,11pt]{book} % Basic style
\usepackage[a4paper,top=2cm,bottom=2cm,left=2cm,right=2cm]{geometry} % Edit margins
\usepackage[bottom]{footmisc} % Sort out footers
\usepackage{graphicx} % Include images
\usepackage{color} % Enable coloured text
\usepackage{multicol} % Enable columns
\usepackage[T1]{fontenc} % Get the \textquotedbl{} entity
\usepackage{textcomp} % Get the \textquotesingle{} entity
\usepackage{lmodern} % Make the fonts pretty again
\usepackage[pdftex]{hyperref} % Enable hyperlinks
\hypersetup{
    pdftitle={An Example Page},
    pdfauthor={Thomas Mark Guymer},
    pdfsubject={Tutorial},
    colorlinks}

% Document
\begin{document}

    % Title
    \title{An Example Page}
    \author{Thomas Guymer}
    \date{\today}
    \maketitle
    \thispagestyle{empty}

    % Contents
    \pagestyle{empty}
    \tableofcontents

    % Style
    \pagestyle{headings}
    \raggedbottom

    % Content
    \chapter*{LICENCE}

    \begin{center}
        \includegraphics[width=43.349mm]{../../graphics/creativeCommons.png}
    \end{center}

    This work is licensed under the \href{http://creativecommons.org/licenses/by-nc/2.0/uk/}{Creative Commons Attribution Non-Commercial 2.0 UK: England \& Wales Licence}\footnote{http://creativecommons.org/licenses/by-nc/2.0/uk/}. This licence is provided by \href{http://www.creativecommons.org/}{Creative Commons}\footnote{http://www.creativecommons.org/}. This means that \textbf{you are free} to: copy; distribute; and modify this work. It also means that \textbf{you cannot} use it for commercial purposes. Additionally, you must attribute this work to the original author, \emph{\href{http://www.thomasguymer.co.uk/who/}{Thomas Guymer}\footnote{http://www.thomasguymer.co.uk/who/}}. As this licence requires attribution \textbf{it is not compatible with any of the \href{http://www.gnu.org/licenses/licenses.html}{GNU Licenses}\footnote{http://www.gnu.org/licenses/licenses.html}}.

    \chapter{Chapter Title}

    Lorem \emph{ipsum} dolor sit amet, consectetur adipisicing elit, sed do \textquotedblleft{}\textsl{eiusmod}\textquotedblright{} tempor incididunt ut labore et dolore magna \textbf{aliqua}. Ut enim ad minim veniam, quis nostrud \textit{exercitation} ullamco laboris nisi ut aliquip ex ea commodo \href{http://www.google.co.uk}{consequat}\footnote{http://www.google.co.uk}. Duis aute irure dolor in reprehenderit in \texttt{voluptate} velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat {\small cupidatat} non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

    \section{Section Title}

    Lorem \emph{ipsum} dolor sit amet, consectetur adipisicing elit, sed do \textquotedblleft{}\textsl{eiusmod}\textquotedblright{} tempor incididunt ut labore et dolore magna \textbf{aliqua}. Ut enim ad minim veniam, quis nostrud \textit{exercitation} ullamco laboris nisi ut aliquip ex ea commodo \href{http://www.google.co.uk}{consequat}\footnote{http://www.google.co.uk}. Duis aute irure dolor in reprehenderit in \texttt{voluptate} velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat {\small cupidatat} non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

    \section{Section Title}

    Lorem \emph{ipsum} dolor sit amet, consectetur adipisicing elit, sed do \textquotedblleft{}\textsl{eiusmod}\textquotedblright{} tempor incididunt ut labore et dolore magna \textbf{aliqua}. Ut enim ad minim veniam, quis nostrud \textit{exercitation} ullamco laboris nisi ut aliquip ex ea commodo \href{http://www.google.co.uk}{consequat}\footnote{http://www.google.co.uk}. Duis aute irure dolor in reprehenderit in \texttt{voluptate} velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat {\small cupidatat} non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

    \chapter{Chapter Title}

    Lorem \emph{ipsum} dolor sit amet, consectetur adipisicing elit, sed do \textquotedblleft{}\textsl{eiusmod}\textquotedblright{} tempor incididunt ut labore et dolore magna \textbf{aliqua}. Ut enim ad minim veniam, quis nostrud \textit{exercitation} ullamco laboris nisi ut aliquip ex ea commodo \href{http://www.google.co.uk}{consequat}\footnote{http://www.google.co.uk}. Duis aute irure dolor in reprehenderit in \texttt{voluptate} velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat {\small cupidatat} non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

    \section{Section Title}

    Lorem \emph{ipsum} dolor sit amet, consectetur adipisicing elit, sed do \textquotedblleft{}\textsl{eiusmod}\textquotedblright{} tempor incididunt ut labore et dolore magna \textbf{aliqua}. Ut enim ad minim veniam, quis nostrud \textit{exercitation} ullamco laboris nisi ut aliquip ex ea commodo \href{http://www.google.co.uk}{consequat}\footnote{http://www.google.co.uk}. Duis aute irure dolor in reprehenderit in \texttt{voluptate} velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat {\small cupidatat} non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

    \section{Section Title}

    Lorem \emph{ipsum} dolor sit amet, consectetur adipisicing elit, sed do \textquotedblleft{}\textsl{eiusmod}\textquotedblright{} tempor incididunt ut labore et dolore magna \textbf{aliqua}. Ut enim ad minim veniam, quis nostrud \textit{exercitation} ullamco laboris nisi ut aliquip ex ea commodo \href{http://www.google.co.uk}{consequat}\footnote{http://www.google.co.uk}. Duis aute irure dolor in reprehenderit in \texttt{voluptate} velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat {\small cupidatat} non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

    \setcounter{footnote}{0}

    \chapter*{Appendix I: List of Abbreviations}

    \markboth{}{}

    Below is a table of the abbreviations used within this document.

    \begin{center}
        \begin{tabular}{ll}
            GNU&GNU\textquoteright{}s Not Unix\\
            HTTP&HyperText Transfer Protocol\\
            PDF&Portable Document Format\\
            PHP&PHP: Hypertext Preprocessor\\
            XHTML&eXtensible HyperText Markup Language\\
        \end{tabular}
    \end{center}

    \setcounter{footnote}{0}

    \chapter*{Appendix II: Author\textquoteright{}s Note}

    \markboth{}{}

    This document was created by converting the XHTML into a \LaTeX{} file using a custom PHP script. The \LaTeX{} file is then converted into this PDF via the \texttt{pdflatex} console command. HTTP Links have been highlighted for ease of identification. These links should be clickable in any PDF viewer. I have also added footnotes containing the link\textquoteright{}s target to aide you if you\textquoteright{}ve printed off this PDF.

    If you have any comments or corrections then please \href{http://www.thomasguymer.co.uk/contact/}{contact me}\footnote{http://www.thomasguymer.co.uk/contact/}, thank you.

\end{document}

...which is then converted into PDF via the Unix command "pdflatex example.tex" - producing this output:

ERROR: PDF not shown. Try downloading the example document.

Simple, eh?

Now that I've given you this head start you just have to do your own handling of tables, lists and source code parser. It took a while for me to do it all but I believe that the results are worth it.

This page was last modified on 02/11/2008. This page has had 368 hits up until 13/06/2010.

Amazon.co.uk logoDo you want to say thank you? Visit my Amazon Wishlist, thank you.Amazon dummy image

Random Photo:

(from My Portfolio)

© 2002-2010 Thomas Guymer. See the Copyright Statement & the Cookie Policy.