Introduction to texor

tools for converting LaTeX source files into RJ-web-articles

Introduction

texor is a proposed package that can help in converting old LaTeX based documents, research papers to HTML through intermediate conversions. This is particularly a problem for R Research papers where HTML export was not available and hence modern compatibility to export a HTML file was missed out. To bring these documents online HTML based webpages are a much better alternative as opposed to PDFs and to solve the lack of web versions texor will provide a mechanism.

Conversion Workflow:

Stage 0 : preparing to convert

In this stage, we will check the basics like using correct path, normalizing the path,extracting the file_name/ wrapper_name etc..

# normalizing path using xfun package
dir <- xfun::normalize_path(dir)

# getting wrapper file name
wrapper_file <- texor::get_wrapper_type(dir)

# getting the main LaTeX file name
file_name <- texor::get_texfile_name(dir)

Stage 1 : Copying/Removing Style files

Pandoc does not need, all of the style files as it is not trying to compile, but rather convert. Hence, to workaround certain limitations, we have to remove the RJournal.sty file and include a new style file which redefines certain commands.

# This function will remove RJournal.sty file,
# Copy the Metafix.sty file and link it in wrapper.
texor::include_style_file(dir)

Stage 2 : Handling Bibliographies

As we do not desire the embedded bibliography to be included as a div element in the article itself, we need to convert it to Bibtex format.

For removing the bibliography div elements from the article we use a Lua filter later on.

For converting the embedded bibliography we use rebib package. By default I have set up the bibliography aggregation function, which will logically create/update the bibtex file and include it in the article_tex_file as well (if not linked).

# bibliography aggregation when both bibtex and embedded bibliography available,
# Using Bibtex file for bibliography if no embedded bibliography available,
# Create a new bibtex file using the embedded bibliography in the document.
rebib::aggregate_bibliography(dir)

Stage 3 : Handing Figures

Texor package creates a yaml report about the figure environments, including tikz, algorithm2e images. There is also a logical function which uses pandoc’s Image data for converting PDF images to PNG.

data <- texor::handle_figures(dir, file_name)

Stage 4 : Patching Environments

Pandoc does not support certain environments, like:
in figures : figure*, algorithmic, algorithm. in table : table*.
in code : example, example*, Sin, Sout, Scode, Sinput, Soutput, smallverbatim, boxedverbatim.

Here, texor will use the stream editor to patch these environments to the default types figure,table and verbatim.

There is also a function to patch equations (especially eqnarray environment).

texor::patch_code_env(dir)
texor::patch_table_env(dir)
texor::patch_figure_env(dir)
texor::patch_equations(dir)

Stage 5 : Converting the LaTeX document to Markdown

Here we will convert the document to Markdown, with a lot of Lua filters modifying the document.

pre_conversion statistics will read through the article and find statistical elements like figure, table, code blocks, citations etc.

meta <- texor::pre_conversion_statistics(dir)
texor::convert_to_markdown(dir)

Stage 6 : Copying over dependency files

This function will copy the files such as figures of all kinds, bibtex file, pdfs etc. to the /web folder.

texor::copy_other_files(dir)

Stage 7 : Converting the Markdown to Rmarkdown

In this stage we convert the markdown to Rmarkdown by reading and adding metadata information like ctv,CRANpkgs,BIOpkgs,slug,author metadata, title, abstract,etc..

We also add important parameters for rjtools::rjournal_web_article like:

  1. self_contained = TRUE
  2. toc = FALSE
  3. legacy_pdf = TRUE
  4. web_only = TRUE
  5. mathjax = “https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml-full.js
texor::generate_rmd(dir)

Stage 8 : Creating RJ-web-article from the Rmarkdown file

texor::produce_html(dir)

Note on Dependencies

This package is involved with solving conversion problems on multiple fronts, thus has to rely on multiple software tools. A list with reasoning is included here:

  1. Pandoc (min: V3.1) 1 : To convert the basic LaTeX to rmarkdown.
  2. poppler pdfutils : To convert embedded PDF to PNG.
  3. LaTeX/TinyTex2 : For Tikz and PDF article generation.
  4. rebib : For bibliography conversions/aggregations.
  5. rjtools : For the rjournal_web_article template.
  6. stringr : for string manipulation functions.
  7. Lua filters 3 4 : Filters for manipulation of ingested files.

  1. included in Rstudio↩︎

  2. read more here↩︎

  3. Filters are included in texor package↩︎

  4. Filters as standalone are also available at this github repo↩︎