An Introduction to working with RTL languages and bidirectional text

Let’s say that you’re interested in the Digital Humanities, and you’re working on a project dealing with a textual corpus that’s written right-to-left (RTL), or where you’d like to incorporate some amount of RTL text into a text that’s predominantly written left-to-right (LTR)– say, for example, if you’re writing up a blog post in English with some Arabic in it. (The latter is called a bidirectional text.) We will return to some of the computational problems you might encounter in doing this in a later post, but for today let’s focus on the very simplest thing you might want to do– producing a document you might want someone else to read at some point– which we can quaintly call typesetting.

(Note here that these are separate problems from using transliterated, Romanized text: proper transliterations may require a bit more labor to type into a computer, but most transliterated characters can be rendered in LTR unicode.)

What tools can you use? What will absolutely not work? And what can work if you’re willing to invest a little bit of time? This is a complicated, somewhat technical subject, and I’m happy to for any additions or corrections. The following is the result of a subjective (and very likely imperfect) attempt to draw more heavily on Arabic texts in DH and other work, and to quote from and work with Arabic sources in texts that have been predominantly written in English.

Before we begin, it may be helpful to give a quick overview of the way RTL and bidirectional texts work at a technical level before discussing specific tools. In the beginning, there was a wild profusion of encodings for different languages, and it is still possible (though much rarer) to run afoul of character encoding problems. (If you’ve ever opened a file and seen a bunch of little squares or question marks, you may have tried to read it in the wrong encoding.) In the past ten years or so, however, one dialect of Unicode (UTF-8) has become the de facto standard for encoding a text: a huge percentage of webpages are encoded in UTF-8, for example, and UTF-8 support is baked into many programming languages in a way that makes working with many of the languages of the world much, much easier if you’re an end user. (If you’re a programmer, Unicode has made your life more complicated in some ways.) Also, if you’re a fan of emoji, they are widespread and interoperable because the Unicode Consortium specifies and describes their character encodings, and there are a number of interesting technical wrinkles there.

But the varying directionalities of writing systems present a different set of problems. One technical solution for a text in a single language is to specify, for example, that the entire document renders RTL (CBA) even if the file itself is stored LTR (ABC). But if you have a dictionary with entries in a mix of Hebrew and English, for example, you can’t specify language direction at the outset; there has to be some way of indicating that these characters go RTL while those are LTR. (I don’t know if this helps, but you want the characters in a bidirectional string to be displayed ABC FED GHI and not ABC DEF GHI, where DEF are the characters in an RTL language.) There are a couple of different ways to solve this, but the central approach in Unicode is that characters have inherent directionality: Hebrew and Arabic characters are rendered RTL (unless they have a good reason not to be: there are actually varying degrees of directionality and invisible control characters that can force directionality… for the full story, both Korpela’s Unicode Explained, pp. 265ff. and the full Unicode bidirectionality spec provide more detail).

If your passion is not the minutiae of Unicode, however, you probably just want to know what works and what doesn’t. The good news is that a lot of things have some measure of bidirectionality support. The bad news is that this support can vary dramatically. Here are some options, and I’ve also indicated the degree of support with emoji. This is a big subject, and I haven’t been able to revisit all of the tools and their possible configurations I’ve listed below, so again, don’t hesitate to get in touch with any advice or corrections.

  • MS Word/Office in Windows 🙂 Support for RTL languages in modern versions of Windows and Office (in Windows) is generally fine. I wouldn’t recommend doing a lot of it, but it’ll work in a pinch. Bidirectionality, on the other hand, can be a real headache, resulting in passages that won’t render correctly but where you can’t see what the problem is. If you need bidirectional support, it can be safer to cut and paste the passages in the opposite direction from another document, at the end of your editing process.
  • MS Office in macOS 😠 For indefensible reasons, RTL and bidirectional support was left out of most versions of Office for Mac until the March 16, 2016 update to Office 2016 for Mac (iOS support came earlier). This means that it will not work to share documents with anyone running older versions of Office, that your documents will be garbled if you open them in an old version and accidentally save them, and that exporting Office documents to another format (say, an Excel file to a CSV) will not work. Feel free to try it now– I haven’t– but be careful.
  • Markdown❓Markdown is a way of writing a plain text file so that it can be converted into HTML. Given that a number of forms of Markdown support the use of HTML tags, you can use them to specify the directionality of a document, though bidirectionality appears to remain largely unimplemented. There’s more discussion here. (If you’re interested in a lightweight markup language for Arabic texts, however, Maxim Romanov’s Open Arabic mARkdown project is worth checking out.)
  • Pandoc 😐 Your handy document conversion tool Pandoc now has some support for bidirectional text, which is quite a bit better than it used to be. For years, Pandoc would garble RTL text, but you can specify the directionality of a document on the command line (with pandoc -V dir=rtl) and at the beginning of a document. Like in Markdown, directionality can be specified at the paragraph level by using HTML div elements which specify directionality. To be honest, I still find it a bit fiddly, and I have not had great success consistently converting either RTL or bidirectional texts with Pandoc, but more information about Pandoc RTL and bidirectionality support can be found here (scroll down) and here. (I suspect that there are further refinements to be made: Pandoc does not appear to make allowance for characters with weak directionality, and I have not had enough success with Pandoc to see how well character mirroring works, either).
  • LaTeX (TeX, ConTeXt, XeLaTeX, etc.) 🙂 Given the fact that TeX and its descendants provide some of the most powerful tools to typeset a document and produce something camera-ready, LaTeX has an enormous amount of potential for typesetting bidirectional documents. However, you will likely find that LaTeX is chiefly used to typeset documents in the sciences, and that documentation addressing the needs of humanities scholars is harder to come by. In working with LaTeX, much depends on hammering out your preamble (including the order of the packages you’re using) and getting the various parts of your LaTeX setup to play nicely with each other. (If this sounds complicated, well… it is.) If you do want to work with multiple, bidirectional languages in LaTeX, I have used polyglossia in the past. (The other major option is babel, which I have not used as extensively.) A clear discussion of multilingual LaTeX typesetting and some sage advice (including on RTL typesetting and working with bidirectional text) can be found in David J. Perry’s “Creating Scholarly Multilingual Documents” (though be advised that some things may have changed since it was written in 2010), and in the polyglossia documentation itself.
  • Text editors 🤔 A full discussion of RTL languages and bidirectional text is beyond the scope of this piece, but a brief listing can be found here. RTL languages have decent support in many popular and/or modern text editors (and Linux support for RTL languages is pretty good, actually). On the other hand, it’s important to point out that some of the claims made on the subject (and in that handy Wikipedia table) are overly optimistic. Vim, for example, is claimed to have bidirectional support, while a quick search makes clear that bidi is basically unsupported. More anecdotally, I’ve found Notepad++ to be pretty decent, bidirectional text in Atom can be a bit gnarly (in particular, when you’re dealing with characters that are mirrored or have weak directionality), and support in Sublime Text has been pretty bad (even for RTL documents, which other editors handle fine).
    Finally, though I’ll skip the longer case for GNU Emacs, RTL/bidi in Emacs is less terrible than a lot of the other options (and the manual goes into helpful detail about its bidi implementation). Emacs does not always put point (the cursor) in a visually helpful place in bidi documents, but the fact that you can always navigate the structure of your document with keyboard shortcuts can be very helpful.
  • XML, HTML, etc. 😢 Let’s be blunt: tag-based markup languages for RTL languages make baby Jesus cry. You know the situation is bad when even the W3C working group laconically notes the “lack of good editing environments” for RTL-language HTML authoring. Because the tags of these markup languages are LTR, any RTL text is a bidirectional text, and so you quickly run into all of the complexities of working with a bidirectional text; and it can be very difficult to see, at a glance, if you’ve written correct HTML (/XML/whatever). If you’ve made headway with this, you’re a braver soul than I am.

Even in 2018, then, there are few solutions that allow you to display bidirectional texts with great success and a minimum amount of bother: it’s just one of those things that you have to fiddle with a bit to get to work right. But the situation in 2018 is better than it was five years ago, so we can always hope that support for bidirectional text in the next five years will make this post completely irrelevant. Meliora speramus.