Wednesday, 21 August 2013

Match LaTeX reserved characters with regex

Match LaTeX reserved characters with regex

I have an HTML to LaTeX parser tailored to what it's supposed to do
(convert snippets of HTML into snippets of LaTeX), but there is a little
issue with filling in variables. The issue is that variables should be
allowed to contain the LaTeX reserved characters (namely # $ % ^ & _ { } ~
\). These need to be escaped so that they won't kill our LaTeX renderer.
The program that handles the conversion and everything is written in
Python, so I tried to find a nice solution. My first idea was to simply do
a .replace(), but replace doesn't allow you to match only if the first is
not a \. My second attempt was a regex, but I failed miserably at that.
The regex I came up with is ([^\][#\$%\^&_\{\}~\\]). I hoped that this
would match any of the reserved characters, but only if it didn't have a \
in front. Unfortunately, this matches ever single character in my input
text. I've also tried different variations on this regex, but I can't get
it to work. The variations mainly consisted of removing/adding slashes in
the second part of the regex.
Can anyone help with this regex?

No comments:

Post a Comment