XML Scripture Publishing

From LSDevLinux
Jump to: navigation, search


Some years ago SIL developed an XML-based Scripture markup standard called XSEM. After getting other agencies and interested parties involved, this grew in scope and became OSIS, a comprehensive, well-engineered and semantically-based way to represent Scripture and related materials using XML.

At the same time, there is an increasing need for translators to be able to carry out draft Scripture publishing in their field locations. Highly quality publishing tools are becoming more widely available, often at low cost or even free, but there needs to be a "smooth path" from the data formats used by Scripture editing software to the formats used by publishing tools. In addition, end-users (translators, literacy workers and Scripture-in-use staff) need to be able to customize the format and style of the printed result without having to have extensive knowledge of publishing terminology and techniques.

Current Status

Most Scripture editing and publishing is currently done using USFM. Although OSIS is regarded as a better solution in many ways, there is little support for it as yet. Users cite the lack of available software tools, and software development managers point to the low level of adoption of OSIS as a reason for postponing the development of tools to support it.

An XML-based publishing path offers great potential for increased automation and greater sophistication of layout, so we'd like to break this cycle by kick-starting development of a proof-of-concept publishing "pipeline". This would demonstrate what's possible with externally-provided tools that are already available for handling XML, and also explore what kind of benefits result from using an all-XML approach.

Available Tools

There are a number of established technologies and tools for getting from semantically-based XML to print.


  • XSLT is used to transform one form of XML into another, typically a semantically-oriented markup into a presentational one, ready for publishing on-screen or in print.
  • XHTML is HTML that adheres to XML markup rules. HTML is one of the most versatile and widely-recognized markup systems for on-screen viewing. It has great advantages over PDF in that it is not oriented to a specific paper-based page-size, and the user can choose how large or small they want to view the text without a resulting need for horizontal scrolling. Although not directly relevant for print output, XHTML is often used for on-screen delivery of Scripture, and XHTML may need to be offered as an alternative output format in the same publication process as is used for print.
  • CSS is a language for specifying styling of text and paragraphs. Although designed for use on the web, it is sophisticated enough for many print purposes too. Combined with X/HTML, CSS enables much of the low-level styling to be taken out of the XML markup and placed in a separate style sheet.
  • XSL:FO is a markup system that is somewhat like XHTML and uses CSS properties for styling but is oriented towards page-formatted, print-style output. Documents that are transformed ("styled") to XSL:FO can then be given to an "FO Processor" to produce print-quality output, typically PDF.
  • PDF is effectively "electronic paper". Unlike XHTML+CSS, it is page-oriented, natively supports vector graphics, and a document can include a copy of the fonts that it uses so that the document will look identical everywhere, both on-screen and in print. Pages can be a disadvantage for on-screen viewing, especially when zooming in to increase the text size for easier reading, but PDF is an ideal medium when preparing documents for print since it allows accurate on-screen proofing and is completely separate from communication with the actual output device.