The CRAN package texor
helps in converting old LaTeX
based documents, research papers to HTML through intermediate
conversions. This was particularly a problem for legacy R Research
papers where HTML export was not available and hence modern
compatibility to export a HTML file was missed out.
We have advanced a lot in the field of web development and modern websites offer a much more interactive and accessible interface for the knowledge we consume. The advantages of a web format are :
For maintaining parity with modern articles, we convert the legacy articles into R markdown format, a markdown based solution developed to allow for publishing PDFs and web content simultaneously without requiring separate documents, along with executable code chunks to reproduce the results during compile.
Now, we have a lot of legacy articles which are only available in the PDF format and to bring these LaTeX based documents into a web format we needed a conversion tool which could read LaTeX and generate a markdown file. The solution exists in a beautiful software written in haskell called “Pandoc”, it is fast, portable and integrated well in the R ecosystem. But there are limitations in the way Pandoc works with LaTeX articles, some of these are :
example
enviornment which is based on top of verbatim
environment,
so we need to devise methods to replace these custom environments with
simple alternatives.Sounds like a lot of hassle to just convert a single LaTeX article to
a R markdown file right ? How nice it would be if we automated
workarounds for most of these limitations programmatically and do not
need to manually perform them for each and every document. This was the
exact thought, when we developed the texor
package and its
sister package rebib
. It did all of the above and reduced
the conversion process for the end user to just a single function
call.
If you are converting a R journal LaTeX article
or in case you are converting a Sweave article1
There are more customization options available, if you desire things to be handled differently but for the most part, the default settings will yield a relatively good conversion to R markdown, which can be knitted to HTML.
This is the aim of the whole package, reducing complexity and automating repetitive tasks for a better conversion process.
Although a key point to note here is, not all documents might convert well or at all. This is due to the nature of LaTeX being a very customizable and less restrictive.
To explain the internal conversion process a bit more in depth, I have divided them into stages, the workflow here is indicative only and may differ from the actual sequence due to updates.
In this stage, we will check the basics like using correct path, normalizing the path,extracting the file_name/ wrapper_name etc..
Pandoc does not need, all of the style files as it is not trying to compile, but rather convert. Hence, to workaround certain limitations, we have to remove the RJournal.sty file and include a new style file which redefines certain commands.
As we do not desire the embedded bibliography to be included as a div element in the article itself, we need to convert it to Bibtex format.
For removing the bibliography div elements from the article we use a Lua filter later on.
For converting the embedded bibliography we use rebib package. By default I have set up the bibliography aggregation function, which will logically create/update the bibtex file and include it in the article_tex_file as well (if not linked).
Texor package creates a yaml report about the figure environments, including tikz, algorithm2e images. There is also a logical function which uses pandoc’s Image data for converting PDF images to PNG.
Pandoc does not support certain environments, like:
in figures : figure*, algorithmic, algorithm.
in table : table*.
in code : example, example*, Sin, Sout, Scode, Sinput, Soutput,
smallverbatim, boxedverbatim.
Here, texor will use the stream editor to patch these environments to
the default types figure
,table
and
verbatim
.
There is also a function to patch equations (especially eqnarray environment).
Here we will convert the document to Markdown, with a lot of Lua filters modifying the document.
This function will copy the files such as figures of all kinds, bibtex file, pdfs etc. to the /web folder.
In this stage we convert the markdown to Rmarkdown by reading and adding metadata information like ctv,CRANpkgs,BIOpkgs,slug,author metadata, title, abstract,etc..
We also add important parameters for
rjtools::rjournal_web_article
like:
This package is involved in tackling multiple challenges, thus has to rely on multiple software tools. A list of dependencies is included here: