HTML to PDF
This work is licensed under the Creative Commons Attribution Non-Commercial 2.0 UK: England & Wales Licence. This means that you are free to: copy; distribute; and modify this work. It also means that you cannot use it for commercial purposes. Additionally, you must attribute this work to the original author, Thomas Guymer, ideally with a link.
This is just a brief work-through. I am not going to show you how to do it as it took me ages to fine-tune it all and I'm not ready to release it all yet. Consider this a 'proof of concept'. It will show you a few stages in the process I use to convert HTML to PDF.
First, consider this HTML source code:
<p class="attention"><img src="/graphics/creativeCommons.png" alt="Creative Commons Logo" width="256" height="60" />This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/2.0/uk/" title="Creative Commons Attribution Non-Commercial 2.0 UK: England & Wales Licence" class="external">Creative Commons Attribution Non-Commercial 2.0 UK: England & Wales Licence</a>. This licence is provided by <a href="http://www.creativecommons.org/" title="Creative Commons" class="external">Creative Commons</a>. This means that <b>you are free</b> to: copy; distribute; and modify this work. It also means that <b>you cannot</b> use it for commercial purposes. Additionally, you must attribute this work to the original author, <em><a href="/who/" title="Who is Thomas Guymer?">Thomas Guymer</a></em>. As this licence requires attribution <b>it is not compatible with any of the <a href="http://www.gnu.org/licenses/licenses.html" title="GNU Licenses" class="external"><abbr title="GNU's Not Unix">GNU</abbr> Licenses</a></b>.</p>
<h2>Chapter Title</h2>
<p>Lorem <em>ipsum</em> dolor sit amet, consectetur adipisicing elit, sed do <q>eiusmod</q> tempor incididunt ut labore et dolore magna <b>aliqua</b>. Ut enim ad minim veniam, quis nostrud <i>exercitation</i> ullamco laboris nisi ut aliquip ex ea commodo <a href="http://www.google.co.uk" title="Google Search Engine">consequat</a>. Duis aute irure dolor in reprehenderit in <code>voluptate</code> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat <small>cupidatat</small> non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<h3>Section Title</h3>
<p>Lorem <em>ipsum</em> dolor sit amet, consectetur adipisicing elit, sed do <q>eiusmod</q> tempor incididunt ut labore et dolore magna <b>aliqua</b>. Ut enim ad minim veniam, quis nostrud <i>exercitation</i> ullamco laboris nisi ut aliquip ex ea commodo <a href="http://www.google.co.uk" title="Google Search Engine">consequat</a>. Duis aute irure dolor in reprehenderit in <code>voluptate</code> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat <small>cupidatat</small> non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<h3>Section Title</h3>
<p>Lorem <em>ipsum</em> dolor sit amet, consectetur adipisicing elit, sed do <q>eiusmod</q> tempor incididunt ut labore et dolore magna <b>aliqua</b>. Ut enim ad minim veniam, quis nostrud <i>exercitation</i> ullamco laboris nisi ut aliquip ex ea commodo <a href="http://www.google.co.uk" title="Google Search Engine">consequat</a>. Duis aute irure dolor in reprehenderit in <code>voluptate</code> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat <small>cupidatat</small> non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<h2>Chapter Title</h2>
<p>Lorem <em>ipsum</em> dolor sit amet, consectetur adipisicing elit, sed do <q>eiusmod</q> tempor incididunt ut labore et dolore magna <b>aliqua</b>. Ut enim ad minim veniam, quis nostrud <i>exercitation</i> ullamco laboris nisi ut aliquip ex ea commodo <a href="http://www.google.co.uk" title="Google Search Engine">consequat</a>. Duis aute irure dolor in reprehenderit in <code>voluptate</code> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat <small>cupidatat</small> non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<h3>Section Title</h3>
<p>Lorem <em>ipsum</em> dolor sit amet, consectetur adipisicing elit, sed do <q>eiusmod</q> tempor incididunt ut labore et dolore magna <b>aliqua</b>. Ut enim ad minim veniam, quis nostrud <i>exercitation</i> ullamco laboris nisi ut aliquip ex ea commodo <a href="http://www.google.co.uk" title="Google Search Engine">consequat</a>. Duis aute irure dolor in reprehenderit in <code>voluptate</code> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat <small>cupidatat</small> non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<h3>Section Title</h3>
<p>Lorem <em>ipsum</em> dolor sit amet, consectetur adipisicing elit, sed do <q>eiusmod</q> tempor incididunt ut labore et dolore magna <b>aliqua</b>. Ut enim ad minim veniam, quis nostrud <i>exercitation</i> ullamco laboris nisi ut aliquip ex ea commodo <a href="http://www.google.co.uk" title="Google Search Engine">consequat</a>. Duis aute irure dolor in reprehenderit in <code>voluptate</code> velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat <small>cupidatat</small> non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
...which produces this HTML document:
Note the strong clear header structure:
-
Document Title
-
Chapter Title
- Section Title
- Section Title
-
Chapter Title
- Section Title
- Section Title
-
Chapter Title
The HTML is converted to LaTeX via simple string-replacement functions in PHP which produces this LaTeX source code:
% Declarations
\documentclass[a4paper,11pt]{book} % Basic style
\usepackage[a4paper,top=2cm,bottom=2cm,left=2cm,right=2cm]{geometry} % Edit margins
\usepackage[bottom]{footmisc} % Sort out footers
\usepackage{graphicx} % Include images
\usepackage{color} % Enable coloured text
\usepackage{multicol} % Enable columns
\usepackage[T1]{fontenc} % Get the \textquotedbl{} entity
\usepackage{textcomp} % Get the \textquotesingle{} entity
\usepackage{lmodern} % Make the fonts pretty again
\usepackage[pdftex]{hyperref} % Enable hyperlinks
\hypersetup{
pdftitle={An Example Page},
pdfauthor={Thomas Mark Guymer},
pdfsubject={Tutorial},
colorlinks}
% Document
\begin{document}
% Title
\title{An Example Page}
\author{Thomas Guymer}
\date{\today}
\maketitle
\thispagestyle{empty}
% Contents
\pagestyle{empty}
\tableofcontents
% Style
\pagestyle{headings}
\raggedbottom
% Content
\chapter*{LICENCE}
\begin{center}
\includegraphics[width=43.349mm]{../../graphics/creativeCommons.png}
\end{center}
This work is licensed under the \href{http://creativecommons.org/licenses/by-nc/2.0/uk/}{Creative Commons Attribution Non-Commercial 2.0 UK: England \& Wales Licence}\footnote{http://creativecommons.org/licenses/by-nc/2.0/uk/}. This licence is provided by \href{http://www.creativecommons.org/}{Creative Commons}\footnote{http://www.creativecommons.org/}. This means that \textbf{you are free} to: copy; distribute; and modify this work. It also means that \textbf{you cannot} use it for commercial purposes. Additionally, you must attribute this work to the original author, \emph{\href{http://www.thomasguymer.co.uk/who/}{Thomas Guymer}\footnote{http://www.thomasguymer.co.uk/who/}}. As this licence requires attribution \textbf{it is not compatible with any of the \href{http://www.gnu.org/licenses/licenses.html}{GNU Licenses}\footnote{http://www.gnu.org/licenses/licenses.html}}.
\chapter{Chapter Title}
Lorem \emph{ipsum} dolor sit amet, consectetur adipisicing elit, sed do \textquotedblleft{}\textsl{eiusmod}\textquotedblright{} tempor incididunt ut labore et dolore magna \textbf{aliqua}. Ut enim ad minim veniam, quis nostrud \textit{exercitation} ullamco laboris nisi ut aliquip ex ea commodo \href{http://www.google.co.uk}{consequat}\footnote{http://www.google.co.uk}. Duis aute irure dolor in reprehenderit in \texttt{voluptate} velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat {\small cupidatat} non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
\section{Section Title}
Lorem \emph{ipsum} dolor sit amet, consectetur adipisicing elit, sed do \textquotedblleft{}\textsl{eiusmod}\textquotedblright{} tempor incididunt ut labore et dolore magna \textbf{aliqua}. Ut enim ad minim veniam, quis nostrud \textit{exercitation} ullamco laboris nisi ut aliquip ex ea commodo \href{http://www.google.co.uk}{consequat}\footnote{http://www.google.co.uk}. Duis aute irure dolor in reprehenderit in \texttt{voluptate} velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat {\small cupidatat} non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
\section{Section Title}
Lorem \emph{ipsum} dolor sit amet, consectetur adipisicing elit, sed do \textquotedblleft{}\textsl{eiusmod}\textquotedblright{} tempor incididunt ut labore et dolore magna \textbf{aliqua}. Ut enim ad minim veniam, quis nostrud \textit{exercitation} ullamco laboris nisi ut aliquip ex ea commodo \href{http://www.google.co.uk}{consequat}\footnote{http://www.google.co.uk}. Duis aute irure dolor in reprehenderit in \texttt{voluptate} velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat {\small cupidatat} non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
\chapter{Chapter Title}
Lorem \emph{ipsum} dolor sit amet, consectetur adipisicing elit, sed do \textquotedblleft{}\textsl{eiusmod}\textquotedblright{} tempor incididunt ut labore et dolore magna \textbf{aliqua}. Ut enim ad minim veniam, quis nostrud \textit{exercitation} ullamco laboris nisi ut aliquip ex ea commodo \href{http://www.google.co.uk}{consequat}\footnote{http://www.google.co.uk}. Duis aute irure dolor in reprehenderit in \texttt{voluptate} velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat {\small cupidatat} non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
\section{Section Title}
Lorem \emph{ipsum} dolor sit amet, consectetur adipisicing elit, sed do \textquotedblleft{}\textsl{eiusmod}\textquotedblright{} tempor incididunt ut labore et dolore magna \textbf{aliqua}. Ut enim ad minim veniam, quis nostrud \textit{exercitation} ullamco laboris nisi ut aliquip ex ea commodo \href{http://www.google.co.uk}{consequat}\footnote{http://www.google.co.uk}. Duis aute irure dolor in reprehenderit in \texttt{voluptate} velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat {\small cupidatat} non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
\section{Section Title}
Lorem \emph{ipsum} dolor sit amet, consectetur adipisicing elit, sed do \textquotedblleft{}\textsl{eiusmod}\textquotedblright{} tempor incididunt ut labore et dolore magna \textbf{aliqua}. Ut enim ad minim veniam, quis nostrud \textit{exercitation} ullamco laboris nisi ut aliquip ex ea commodo \href{http://www.google.co.uk}{consequat}\footnote{http://www.google.co.uk}. Duis aute irure dolor in reprehenderit in \texttt{voluptate} velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat {\small cupidatat} non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
\setcounter{footnote}{0}
\chapter*{Appendix I: List of Abbreviations}
\markboth{}{}
Below is a table of the abbreviations used within this document.
\begin{center}
\begin{tabular}{ll}
GNU&GNU\textquoteright{}s Not Unix\\
HTTP&HyperText Transfer Protocol\\
PDF&Portable Document Format\\
PHP&PHP: Hypertext Preprocessor\\
XHTML&eXtensible HyperText Markup Language\\
\end{tabular}
\end{center}
\setcounter{footnote}{0}
\chapter*{Appendix II: Author\textquoteright{}s Note}
\markboth{}{}
This document was created by converting the XHTML into a \LaTeX{} file using a custom PHP script. The \LaTeX{} file is then converted into this PDF via the \texttt{pdflatex} console command. HTTP Links have been highlighted for ease of identification. These links should be clickable in any PDF viewer. I have also added footnotes containing the link\textquoteright{}s target to aide you if you\textquoteright{}ve printed off this PDF.
If you have any comments or corrections then please \href{http://www.thomasguymer.co.uk/contact/}{contact me}\footnote{http://www.thomasguymer.co.uk/contact/}, thank you.
\end{document}
...which is then converted into PDF via the Unix command "pdflatex example.tex" - producing this output:
Simple, eh?
Now that I've given you this head start you just have to do your own handling of tables, lists and source code parser. It took a while for me to do it all but I believe that the results are worth it.
