Bible Taxonomy – Start Off Right (Minimized XML)

BibleTax_Featured_1
Note: This post is series explaining how I created the new Bible Taxonomy tool as seen on DiscipleShare. To see it in action, or to find great, free curriculum to use in churches, visit: http://www.discipleshare.net/

To start, I looked for data objects that already existed for the bible on the internet. I looked at using existing APIs for popular online Bible services, but didn’t find any that offered the backend database support I’d need for relational tables in MySQL.

I googled “NRSV xml” and found some good stuff, including a file that’s no doubt a copyright violation and might be taken down at any time.

(If starting over, I’d use the SBL GNT or a KJV version that’s now public domain)

The first bottleneck I stumbled on when building the plugin was the bible structure. How could I get an XML file that was currently > 5000 KB to the user without them leaving the site first?

Answer 1: Strip the verses out.

The second bottleneck: once I’d stripped the verses (the bulk of the data) out of the file, was there a way to compress it even further?

Answer 2: All of the data needed are book names + number of chapters + number of verses per chapter. At that point, the XML resembles a nested array of strings + integers — at least more than it resembles a dictionary or catalog of multi-layered data objects.

I was enough of a newbie at data objects in PHP, so I chose to use C# and the Visual Studio IDE so I could debug and troubleshoot quicker.

Here’s the resulting file (just kept it all in solution’s default doc: Program.cs)

using System;
using System.Collections.Generic;
using System.Collections.Concurrent;
using System.Linq;
using System.Text;
using System.Xml;
using System.IO;
using System.Xml.Linq;
using System.Dynamic;
using System.Reflection;

namespace BibleXmlStructurizer
{
    class Program
    {

        static void Main(string[] args)
        {
            elementid = 0;
			
			string filepath = "";
            XDocument doc = XDocument.Load(filepath);

            // load Bible
            XElement bible = doc.Descendants("bible").FirstOrDefault();
            
            //oldtestament for a tag that helps the JS parse it into separate visual elements
            bool oldtestament = true;

			// Generic to tell what types of data for keys, values
            Dictionary<string, string> bookDictionary = new Dictionary<string, string>();

            foreach (var book in bible.Elements("book"))
            {
                // Get name from the attribute value
                var name = book.Attribute("name").Value;
                
                // calls method to get XML data
                string bookXml = GetBookXML(book, oldtestament); 
                
                // Add to dictionary with name as the key.
                bookDictionary[name] = bookXml;
            }

			// Output
            StringBuilder sb = new StringBuilder();
            sb.Append("<bible>");
            foreach (var book in bookDictionary)
            {
                sb.Append(book.Value);
            }
            sb.Append("</bible>");

            System.Text.ASCIIEncoding encoding = new System.Text.ASCIIEncoding();
            byte[] txt = encoding.GetBytes(sb.ToString());
            outputfilepath = "";
            FileStream fs = new FileStream(outputfilepath, FileMode.Create, FileAccess.ReadWrite);
            BinaryWriter bw = new BinaryWriter(fs);
            bw.Write(txt);
            bw.Close();
            
        }
        
        static string GetBookXML(XElement book, bool oldtestament)
        {
            var sb = new StringBuilder();
            XmlTextWriter output = new XmlTextWriter(new StringWriter(sb));
            output.WriteStartElement("book");
            
            // splits attributes based on " character ; for newbies, \ is escape character
            string splitter = "\"";
            string[] splitAttribute;
  
            splitAttribute = book.Attribute("name").ToString().Split(splitter.ToCharArray());
            output.WriteAttributeString("name", splitAttribute[1]);

            if (oldtestament)
                output.WriteAttributeString("section", "OT");
            else
                output.WriteAttributeString("section", "NT");

            int chaptercount = book.Elements("chapter").Count();
			output.WriteAttributeString("chaptercount", chaptercount.ToString());

            foreach (var chapter in book.Elements("chapter"))
            {
                int versecount = 0;

                output.WriteStartElement("chapter"); // <chapter>
                splitAttribute = chapter.Attribute("name").ToString().Split(splitter.ToCharArray());
                output.WriteAttributeString("name", splitAttribute[1]);

                output.WriteAttributeString("id", elementid.ToString());

                foreach (var verse in chapter.Elements("verse"))
                {
                    versecount++;

                    output.WriteStartElement("verse"); // <verse>
                    output.WriteAttributeString("name", versecount.ToString());
                    output.WriteAttributeString("id", elementid.ToString());
                    output.WriteString(verse.Value.ToString());
                    output.WriteEndElement(); // </verse>

                }

                output.WriteEndElement(); // </chapter>
            }

            if (book.Attribute("name").ToString() == "name=\"Malachi\"")
            {
                oldtestament = false;
            }

            
            output.WriteEndElement();
            output.Close();

            return sb.ToString();
        }
    }
}

The third bottleneck: for the database structure, I needed easier references to each bible element (so that I didn’t have to keep parsing out on every GET or PUT to figure out what it referenced).

Answer 3: add this to the class

 public static int elementid;

and then for every element write it out as an attribute:

output.WriteAttributeString("id", elementid.ToString());

then, after every time you’ve written to the XML file, add incrementing code

element++;

While it’s not the most elegant code, it got the job done — and it gave me the resulting lightweight XML file I needed.

Full Bible: >5000 KB
Bible_min_with_IDs: 983 KB
Bible_min: 49KB

49 KB is acceptable to store in a user’s browser cache so they have a good user experience every time. Also, I’m not too worried about the Bible_min_with_IDs because 983 KB will be loaded locally by the PHP code — not too big of an issue.

Next up: Javascript + CSS to make this look pretty.

Speak Your Mind