Академический Документы
Профессиональный Документы
Культура Документы
developerWorks
http://www.ibm.com/developerworks/opensource/library/x-po...
Technical topics
Open source
Technical library
Share:
Colin Beckingham is a freelance researcher, writer, and programmer who lives in eastern Ontario, Canada. Holding degrees from Queen's
University, Kingston, and the University of Windsor, he has worked in a rich variety of fields including banking, horticulture, horse racing,
teaching, civil service, retail, and travel and tourism. The author of database applications and numerous newspaper, magazine, and online
articles, his research interests include open source programming, VoIP, and voice-control applications on Linux. You can reach Colin at
colbec@start.ca.
30 August 2011
Also available in Chinese Japanese Spanish
For the former problem, where the EPUB is broken internally, you
can use the EpubCheck project (see Resources for a link). The
get
1 di 7
12/12/15 09:35
http://www.ibm.com/developerworks/opensource/library/x-po...
The scanner has read to the end of a page, put in a paragraph tag regardless of whether it applies to
ensure that the page is syntactically complete, then starts at the top of the next page, ensuring that it
begins with a new paragraphagain, whether it is appropriate or not. It makes for complete code but
incomplete paragraphs because of orphaned sections. On the e-Reader, the user might see both
sections on the same device page with no page marker displayed but the paragraph sections separated
as if they were independent paragraphs.
Similarly, consider blank pages:
<div class="newpage" id="page-128"></div>
<p></p>
<div class="newpage" id="page-129"></div>
Does page 129 in the snippet above really exist? It might be important to preserve it blank, but otherwise,
it is inconvenient to have to turn two pages when only one should be necessary.
Spelling errors are a different kind of problem where you compare two different lists of words rather than
look for complex patterns. This problem you deal with separately using scripting methods.
Sigil
Sigil (see Resources for the website and support pages) is a WYSIWYG EPUB editor that can find the
pattern-matching types of errors and allow programmers to correct them. See the Regular expressions
sidebar for a quick introduction to regular expressions, and see Resources for more detailed information.
Sigil might not be available from your Linux repository, but it is
Regular expressions
GUI, click File > Open to open your EPUB directly. Doing so
extracts the EPUB and displays a directory of the component files
on the left; it reveals a browser pane on the right in which you can
display the contents of individual files either as you might view them
[^.]</p>,
Choose one of the HTML files that your EPUB contains, and
problem.
Code View to display the code behind the file. All the tags should
now be visible.
Suppose that you want to find orphaned paragraph chunks. The
criterion you are looking for is </p> end-of-paragraph tags that are
not preceded by a normal end-of-sentence character. The most
common of these characters is the period. Sigil provides a search
function (Edit > Find), and the normal search mode lets you find
strings like .</p>, but it does not help you find the end of paragraph
that does not have a period before it. For this, you need the regular
expression search mode, which appears when you click More. Navigate to the top of the code in the
browser window, then perform these steps:
1. Select Down for the direction.
2. Select Regular expression for the search mode.
3. Type [^.]</p> as your Find what string.
4. Click Find Next.
2 di 7
12/12/15 09:35
http://www.ibm.com/developerworks/opensource/library/x-po...
This process should find what you are seeking, if it exists. If there are no hits, you might want to create
one temporarily just to check that the search function works.
After using this technique for a while, you soon find that paragraphs can legitimately end with characters
other than periods. You find that double quotation marks ("), exclamation mark (!), question marks (?),
and maybe some other characters fit the requirement of a complete sentence. Allowing for this is not a
problem with regular expressions. Because the square brackets indicate a group, if you change the Find
what to [^.?!"]</p>, the search accepts as normal anything that has a period, question mark,
exclamation mark, or double quotation mark at the end of a paragraph and flags as erroneous anything
else.
Another tell-tale sign of a broken paragraph might be those that begin with <p> followed by a lowercase
alphabetic character. The regular expression version of this would be <p>[a-z].. Another useful one is
<p>[0-9]., which looks for paragraphs that begin with numbers. This sign might be valid where the
scanner has picked up a page number that in an e-Reader context might no longer be relevant.
How you decide to fix one of these errors is another matter. If a page marker separates the two pieces,
you might move the marker to before or after the true paragraph and rejoin the two pieces to make one
single paragraph. The page numbering is then approximately but not perfectly accurate.
Searching for page markers is a similar process. Again, using the regular expression option if the Find
what is page-[0-9]+, the editor searches for any string that begins with the literal characters p, a, g, e,
and dash followed by at least one of and maybe several number characters from the range zero to nine.
An interesting break that you can find easily is one where a word, paragraph, and page are all broken at
the same time. The print version indicates the break with a hyphen or dash, which is easily visible and
searched for in code view:
<p>This is where my paragraph begins, hits the end of a phys-</p>
<div class="newpage" id="page-12"></div>
<p>ical page and then continues from the top of the next physical page,
finally coming to an end here.</p>
In this case, a global normal search using the Find what string of -</p> should pick them out quite
quickly.
3 di 7
12/12/15 09:35
http://www.ibm.com/developerworks/opensource/library/x-po...
if (!file_exists($opf_file)) {
//cleanup();
die("Cannot find the OPF file\n");
} else {
echo "Found it!\n";
$xml = simplexml_load_file($opf_file);
// get the manifest items
foreach ($xml->manifest->item as $mi) {
if ($mi['media-type']=='application/xhtml+xml') {
echo "Found ".$mi['href']."\n";
if (substr($mi['href'],0,4) == 'part') {
echo "Page number check in document ".$mi['href']."\n";
echo scan_chap("./OEBPS/".$mi['href']);
}
}
}
}
function scan_chap($chap) {
global $firstpage, $oldpage;
echo "Trying to page num check section $chap \n";
if (!file_exists($chap)) {
echo "Cannot find the chapter $chap\n";
} else {
echo "Found it!\n";
$xml = simplexml_load_file($chap);
//$i = 0;
foreach ($xml->body->div->div as $pagnumdiv) {
if ($pagnumdiv["class"]=='newpage') {
echo $pagnumdiv["id"]."\n";
$page = (int) substr($pagnumdiv["id"],5);
if ($firstpage == 0) {
$firstpage = $oldpage = $page;
} else {
if ($page != $oldpage+1) echo "Problem at page after $oldpage\n";
$oldpage++;
}
}
}
}
return "Done...\n";
}
?>
The code first sets up global variables for the number of the first logical page found (set once at the
beginning of the loop) and the number of the previous page checked (that changes with each iteration). It
then declares the name of the OPF file, looks for that file, andif it cannot find itends with an error. If
the file is found, the script opens the file as an XML object and looks for the names of the files mentioned
in the manifest that appear to be HTML using the media-type attribute. In this particular EPUB
document, some HTML files contain only a full-page image and therefore can be ignored. The file names
of these pages contain the string leaf; the other files that contain extended text have a part label. The
code filters these out using substrings.
Now that you know the name of the file, you can read this file into its own simpleXML object. Iterating
through the <div> tags and filtering for those that have a class attribute of newpage, you can find the
value of the id attribute that contains the page number. You need to let the book tell you which number is
the first page because this is often not page 1, and after this value is stored in the global first page
variable, you can go on to predict what the number of the next page should be. If it happens not to be the
expected number, the script generates an error and continues checking.
This script does not attempt to make changes to the text. It merely flags what it thinks might need your
attention.
4 di 7
12/12/15 09:35
http://www.ibm.com/developerworks/opensource/library/x-po...
using SimpleXML but also for using the Enchant spelling manager library. Enchant is capable of
managing multiple different base spelling lists. It helps to differentiate UK English from US English
spellings, for example.
The script in Listing 2 examines each of the manifest files separately using the same method as in Listing
1, this time going through paragraph by paragraph and word by word checking each against the known
spelling list. It uses the same method of iterating through the HTML component files as in Listing 1 and
adds the required instructions to access the dictionaries.
Listing 2. Spell checking the EPUB with PHP, SimpleXML, and Enchant
<?php
// spell check an epub
/* epub is a zipped package containing many files
the file "content.opf" contains the pointers to the constituent files
inside content.opf we have
package (root)
-> manifest
-> item
which we need to filter for media-type="application/xhtml+xml"
and to check these are real text pages, not just full page images
these are the text chapters that need to be checked one by one
Acknowledgment: Some of the dictionary-related code
was copied from the PHP Enchant manual page
*/
// set up console for input
$console = fopen("php://stdin","r");
// set up enchant (from PHP manual)
$tag = 'en_CA';
$r = enchant_broker_init();
$bprovides = enchant_broker_describe($r);
echo "Current broker provides the following backend(s):\n";
print_r($bprovides);
$dicts = enchant_broker_list_dicts($r);
print_r($dicts);
if (enchant_broker_dict_exists($r,$tag)) {
$d = enchant_broker_request_dict($r, $tag);
$dprovides = enchant_dict_describe($d);
echo "dictionary $tag provides:\n";
} else {
cleanup();
die ("Cannot set up the spell checker\n");
}
// look for the text to be checked
$opf_file = "./OEBPS/content.opf";
if (!file_exists($opf_file)) {
cleanup();
die("Cannot find the OPF file\n");
} else {
echo "Found it!\n";
$xml = simplexml_load_file($opf_file);
foreach ($xml->manifest->item as $mi) {
if ($mi['media-type']=='application/xhtml+xml') {
echo "Found ".$mi['href']."\n";
if (substr($mi['href'],0,4) == 'part') {
echo "Need to spell check ".$mi['href']."\n";
echo scan_chap("./OEBPS/".$mi['href']);
}
}
}
}
function cleanup() {
global $d, $r;
enchant_broker_free_dict($d);
enchant_broker_free($r);
}
function scan_chap($chap) {
echo "Trying to spell check section $chap \n";
if (!file_exists($chap)) {
echo "Cannot find the chapter $chap\n";
} else {
echo "Found it!\n";
$xml = simplexml_load_file($chap);
$i = 0;
foreach ($xml->body->div->p as $para) {
echo $para."\n";
// need to spell check the contents of $para
spell_check(trim($para));
$i++;
if ($i > 5) break;
}
}
return "Done...\n";
}
function spell_check($para) {
global $console, $d;
$para = str_replace(" "," ",$para);
$para = str_replace(".","",$para);
$para = $para." ";
echo "Checking text : $para\n";
$start = 0;
while ($pos !== false) {
$pos = strpos($para," ",$start);
echo "Found $pos\n";
if (!$pos) break;
5 di 7
12/12/15 09:35
http://www.ibm.com/developerworks/opensource/library/x-po...
$len = $pos-$start;
$theword = substr($para,$start,$len);
// tidy up theword which may contain punctuation
$punc = array(':',';',',','"','?','!');
$theword = str_replace($punc,"",$theword);
//
if ((strlen($theword) > 0) and (!is_numeric($theword))) {
if ($wordcorrect = enchant_dict_check($d, $theword)) {
echo "$theword is OK!\n";
} else {
$suggs = enchant_dict_suggest($d, $theword);
echo "Suggestions for <$theword>:\n";
//print_r($suggs);
$max = 5;
foreach ($suggs as $k=>$sugg) {
echo "$k => $sugg\n";
if ($k > $max) break;
}
$inp = fgets($console,1024);
}
}
$start += $len+1;
}
}
?>
In this code, you start by declaring a file pointer to standard input so that you can get interactive
information from the keyboard during the spell-check process. The next section establishes the
connection to the dictionaries. Note that the tag variable indicates en-CA, which, in this instance, puts a
preference on Canadian English. The result is that the checker chooses colour over color,
acknowledgement over acknowledgment, and so on. A more standard setting for the tag is en-US. After
the dictionary is connected, it performs the same search for HTML text files as in Listing 1, but this time,
instead of looking for page number <div> tags, it looks for paragraphs with real text.
Before performing the actual spell check, the script cleans up the paragraph text to make it more
manageable by removing long spaces and removing periods and commas because the goal is to
examine word by word. Then, the actual spell checking starts by moving from word to word in the
paragraph, ignoring words that are numbers and comparing the word to the dictionary. Where the
dictionary does not contain the word, the script suggests words that might be a better substitute. In this
case, the script presents only the first five alternates. The script halts at each problem word and waits for
user input from the keyboard. At this point, you can add code to change, ignore once, ignore for the
session, and so on.
Conclusion
Both Sigil and PHP scripting with XML and spelling libraries are helpful tools in finding and fixing errors
that cannot be detected using normal EPUB checking routines. Whether these secondary errors are truly
errors or just minor cosmetic inconveniences depends on the context in which you are using the
document and the ability of the hardware reader and its own software to resolve these issues on the fly.
Resources
Learn
Build a digital book with EPUB (Liza Daly, developerWorks, updated January
2011, published November 2008): Read an introduction to EPUB and a list of
EPUB resources.
Know your regular expressions, (Michael Stutz, developerWorks, June 2007):
Check out this introduction to regular expressions on UNIX systems.
Discover the available tools and techniques that can help you learn how to
construct regular expressions for various programs and languages.
More articles by this author (Colin Beckingham, developerWorks, March
2009-current): Read articles about XML, voice recognition, XHTML, PHP,
SMIL, and other technologies.
New to XML? Get the resources you need to learn XML.
6 di 7
developerWorks Premium
Exclusive tools to build your next
great app. Learn more.
developerWorks Labs
12/12/15 09:35
http://www.ibm.com/developerworks/opensource/library/x-po...
XML area on developerWorks: Find the resources you need to advance your
skills in the XML arena, including DTDs, schemas, and XSLT. See the XML
technical library for a wide range of technical articles and tips, tutorials,
standards, and IBM Redbooks.
IBM XML certification: Find out how you can become an IBM-Certified
7 di 7
12/12/15 09:35