uk.ac.kcl.cch.jb.pliny.imageRes.dnd
Class HtmlConverter

java.lang.Object
  extended byuk.ac.kcl.cch.jb.pliny.imageRes.dnd.HtmlConverter

public class HtmlConverter
extends Object

a utility class that processes HTML from a web page for Image Resources by extracting the text (which will appear in a note initially set up on the image page) and coding it into a WIKI-like markup, and locating images suitable as Image Resources and identifying them.

This uses methods provided by org.w3c.tidy, for which thanks is hereby given.

Author:
John Bradley

Nested Class Summary
 class HtmlConverter.ImageData
           
 
Constructor Summary
HtmlConverter(InputStream in, URL theURL)
          this constructor takes an InputStream that points to the HTML page specified by the given URL, and uses org.w3c.tidy to create a DOM of the text, which can subsequently be harvested for either text or list of images.
 
Method Summary
 HtmlConverter.ImageData[] getImageData()
          fetches information about images that were found on the given HTML page.
 String getTextualContents()
          takes the DOM representation of the HTML page and converts the text found therein in to a WIKI-markup-like text string.
 String getTitle()
          gets the text of the HTML title element.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlConverter

public HtmlConverter(InputStream in,
                     URL theURL)
this constructor takes an InputStream that points to the HTML page specified by the given URL, and uses org.w3c.tidy to create a DOM of the text, which can subsequently be harvested for either text or list of images.

Parameters:
in - an InputStream for the HTML page.
theURL - the URL to the HTML page.
Method Detail

getImageData

public HtmlConverter.ImageData[] getImageData()
fetches information about images that were found on the given HTML page.

Returns:
an array of ImageData elements.

getTitle

public String getTitle()
gets the text of the HTML title element. If not provided, returns an empty string.


getTextualContents

public String getTextualContents()
takes the DOM representation of the HTML page and converts the text found therein in to a WIKI-markup-like text string.