https://github.com/jhy/jsoup
https://github.com/code4craft/jsoup-learning
http://my.oschina.net/flashsword/blog/156748
public String getPlainText(Element element) {
FormattingVisitor formatter = new FormattingVisitor();
NodeTraversor traversor = new NodeTraversor(formatter);
traversor.traverse(element); // walk the DOM, and call .head() and .tail() for each node
return formatter.toString();
}
// the formatting rules, implemented in a breadth-first DOM traverse
private class FormattingVisitor implements NodeVisitor {
private static final int maxWidth = 80;
private int width = 0;
private StringBuilder accum = new StringBuilder(); // holds the accumulated text
// hit when the node is first seen
public void head(Node node, int depth) {
String name = node.nodeName();
if (node instanceof TextNode)
append(((TextNode) node).text()); // TextNodes carry all user-readable text in the DOM.
else if (name.equals("li"))
append("\n * ");
else if (name.equals("dt"))
append(" ");
else if (StringUtil.in(name, "p", "h1", "h2", "h3", "h4", "h5", "tr"))
append("\n");
}
// hit when all of the node's children (if any) have been visited
public void tail(Node node, int depth) {
String name = node.nodeName();
if (StringUtil.in(name, "br", "dd", "dt", "p", "h1", "h2", "h3", "h4", "h5"))
append("\n");
else if (name.equals("a"))
append(String.format(" <%s>", node.absUrl("href")));
}
// appends text to the string builder with a simple word wrap method
private void append(String text) {
if (text.startsWith("\n"))
width = 0; // reset counter if starts with a newline. only from formats above, not in natural text
if (text.equals(" ") &&
(accum.length() == 0 || StringUtil.in(accum.substring(accum.length() - 1), " ", "\n")))
return; // don't accumulate long runs of empty spaces
if (text.length() + width > maxWidth) { // won't fit, needs to wrap
String words[] = text.split("\\s+");
for (int i = 0; i < words.length; i++) {
String word = words[i];
boolean last = i == words.length - 1;
if (!last) // insert a space if not the last word
word = word + " ";
if (word.length() + width > maxWidth) { // wrap and reset counter
accum.append("\n").append(word);
width = word.length();
} else {
accum.append(word);
width += word.length();
}
}
} else { // fits as is, without need to wrap text
accum.append(text);
width += text.length();
}
}
}
http://my.oschina.net/flashsword/blog/156798
https://github.com/code4craft/jsoup-learning
http://my.oschina.net/flashsword/blog/156748
jsoup
├── examples #样例,包括一个将html转为纯文本和一个抽取所有链接地址的例子。
├── helper #一些工具类,包括读取数据、处理连接以及字符串转换的工具
├── nodes #DOM节点定义
├── parser #解析html并转换为DOM树
├── safety #安全相关,包括白名单及html过滤
└── select #选择器,支持CSS Selector以及NodeVisitor格式的遍历
Jsoup的入口是
Jsoup
类。examples包里提供了两个例子,解析html后,分别用CSS Selector以及NodeVisitor来操作Dom元素。
这里用
Document doc = Jsoup.connect(url).get();
// 使用select方法选择元素,参数是CSS Selector表达式
Elements links = doc.select("a[href]");
print("\nLinks: (%d)", links.size());
for (Element link : links) {
//使用abs:前缀取绝对url地址
print(" * a: <%s> (%s)", link.attr("abs:href"), trim(link.text(), 35));
}
ListLinks
里的例子来说明如何调用Jsoup:
Jsoup使用了自己的一套DOM代码体系,这里的Elements、Element等虽然名字和概念都与Java XML API
org.w3c.dom
类似,但并没有代码层面的关系。就是说你想用XML的一套API来操作Jsoup的结果是办不到的,但是正因为如此,才使得Jsoup可以抛弃xml里一些繁琐的API,使得代码更加简单。
还有一种方式是通过
NodeVisitor
来遍历DOM树,这个在对整个html做分析和替换时比较有用:public interface NodeVisitor {
//遍历到节点开始时,调用此方法
public void head(Node node, int depth);
//遍历到节点结束时(所有子节点都已遍历完),调用此方法
public void tail(Node node, int depth);
}
FormattingVisitor formatter = new FormattingVisitor();
NodeTraversor traversor = new NodeTraversor(formatter);
traversor.traverse(element); // walk the DOM, and call .head() and .tail() for each node
return formatter.toString();
}
// the formatting rules, implemented in a breadth-first DOM traverse
private class FormattingVisitor implements NodeVisitor {
private static final int maxWidth = 80;
private int width = 0;
private StringBuilder accum = new StringBuilder(); // holds the accumulated text
// hit when the node is first seen
public void head(Node node, int depth) {
String name = node.nodeName();
if (node instanceof TextNode)
append(((TextNode) node).text()); // TextNodes carry all user-readable text in the DOM.
else if (name.equals("li"))
append("\n * ");
else if (name.equals("dt"))
append(" ");
else if (StringUtil.in(name, "p", "h1", "h2", "h3", "h4", "h5", "tr"))
append("\n");
}
// hit when all of the node's children (if any) have been visited
public void tail(Node node, int depth) {
String name = node.nodeName();
if (StringUtil.in(name, "br", "dd", "dt", "p", "h1", "h2", "h3", "h4", "h5"))
append("\n");
else if (name.equals("a"))
append(String.format(" <%s>", node.absUrl("href")));
}
// appends text to the string builder with a simple word wrap method
private void append(String text) {
if (text.startsWith("\n"))
width = 0; // reset counter if starts with a newline. only from formats above, not in natural text
if (text.equals(" ") &&
(accum.length() == 0 || StringUtil.in(accum.substring(accum.length() - 1), " ", "\n")))
return; // don't accumulate long runs of empty spaces
if (text.length() + width > maxWidth) { // won't fit, needs to wrap
String words[] = text.split("\\s+");
for (int i = 0; i < words.length; i++) {
String word = words[i];
boolean last = i == words.length - 1;
if (!last) // insert a space if not the last word
word = word + " ";
if (word.length() + width > maxWidth) { // wrap and reset counter
accum.append("\n").append(word);
width = word.length();
} else {
accum.append(word);
width += word.length();
}
}
} else { // fits as is, without need to wrap text
accum.append(text);
width += text.length();
}
}
}
http://my.oschina.net/flashsword/blog/156798
我们先来看看nodes包的类图:
这里可以看到,核心无疑是
Node
类。
Node类是一个抽象类,它代表DOM树中的一个节点,它包含:
- 父节点
parentNode
以及子节点childNodes
的引用 - 属性值集合
attributes
- 页面的uri
baseUri
,用于修正相对地址为绝对地址 - 在兄弟节点中的位置
siblingIndex
,用于进行DOM操作
Node里面包含一些获取属性、父子节点、修改元素的方法,其中比较有意思的是
absUrl()
。我们知道,在很多html页面里,链接会使用相对地址,我们有时会需要将其转变为绝对地址。Jsoup的解决方案是在attr()的参数开始加"abs:“,例如attr(“abs:href”),而absUrl()
就是其实现方式。URL base;
try {
try {
base = new URL(baseUri);
} catch (MalformedURLException e) {
// the base is unsuitable, but the attribute may be abs on its own, so try that
URL abs = new URL(relUrl);
return abs.toExternalForm();
}
// workaround: java resolves '//path/file + ?foo' to '//path/?foo', not '//path/file?foo' as desired
if (relUrl.startsWith("?"))
relUrl = base.getPath() + relUrl;
// java URL自带的相对路径解析
URL abs = new URL(base, relUrl);
return abs.toExternalForm();
} catch (MalformedURLException e) {
return "";
}
Node还有一个比较值得一提的方法是
abstract String nodeName()
,这个相当于定义了节点的类型名(例如Document
是'#Document',Element
则是对应的TagName)。
Element也是一个重要的类,它代表的是一个HTML元素。它包含一个字段
tag
和classNames
。classNames是"class"属性解析出来的集合,因为CSS规范里,“class"属性允许设置多个,并用空格隔开,而在用Selector选择的时候,即使只指定其中一个,也能够选中其中的元素。所以这里就把"class"属性展开了。Element还有选取元素的入口,例如select
、getElementByXXX
,这些都用到了select包中的内容,这个留到下篇文章select再说。
Document是代表整个文档,它也是一个特殊的Element,即根节点。Document除了Element的内容,还包括一些输出的方法。
Document还有一个属性
quirksMode
,大致意思是定义处理非标准HTML的几个级别,这个留到以后分析parser的时候再说。
Node还有一些方法,例如
outerHtml()
,用作节点及文档HTML的输出,用到了树的遍历。在DOM树的遍历上,用到了NodeVisitor
和NodeTraversor
来对树的进行遍历。NodeVisitor
在上一篇文章提到过了,head()和tail()分别是遍历开始和结束时的方法,而NodeTraversor
的核心代码如下:public void traverse(Node root) {
Node node = root;
int depth = 0;
//这里对树进行后序(深度优先)遍历
while (node != null) {
//开始遍历node
visitor.head(node, depth);
if (node.childNodeSize() > 0) {
node = node.childNode(0);
depth++;
} else {
//没有下一个兄弟节点,退栈
while (node.nextSibling() == null && depth > 0) {
visitor.tail(node, depth);
node = node.parent();
depth--;
}
//结束遍历
visitor.tail(node, depth);
if (node == root)
break;
node = node.nextSibling();
}
}
}
这里使用循环+回溯来替换掉了我们常用的递归方式,从而避免了栈溢出的风险。
实际上,Jsoup的Selector机制也是基于
http://my.oschina.net/flashsword/blog/157031NodeVisitor
来实现的,可以说NodeVisitor
是更加底层和灵活的API。
分析代码前,我们不妨先想想,“tidy HTML"到底包括哪些东西:
- 换行,块级标签习惯上都会独占一行
- 缩进,根据HTML标签嵌套层数,行首缩进会不同
- 严格的标签闭合,如果是可以自闭合的标签并且没有内容,则进行自闭合
- HTML实体的转义
这里要补充一下HTML标签的知识。HTML Tag可以分为block和inline两类。关于Tag的inline和block的定义可以参考http://www.w3schools.com/html/html_blocks.asp,而Jsoup的
Tag
类则是对Java开发者非常好的学习资料。//block tags,需要换行
private static final String[] blockTags = {
"html", "head", "body", "frameset", "script", "noscript", "style", "meta", "link", "title", "frame",
"noframes", "section", "nav", "aside", "hgroup", "header", "footer", "p", "h1", "h2", "h3", "h4", "h5", "h6",
"ul", "ol", "pre", "div", "blockquote", "hr", "address", "figure", "figcaption", "form", "fieldset", "ins",
"del", "s", "dl", "dt", "dd", "li", "table", "caption", "thead", "tfoot", "tbody", "colgroup", "col", "tr", "th",
"td", "video", "audio", "canvas", "details", "menu", "plaintext"
};
//inline tags,无需换行
private static final String[] inlineTags = {
"object", "base", "font", "tt", "i", "b", "u", "big", "small", "em", "strong", "dfn", "code", "samp", "kbd",
"var", "cite", "abbr", "time", "acronym", "mark", "ruby", "rt", "rp", "a", "img", "br", "wbr", "map", "q",
"sub", "sup", "bdo", "iframe", "embed", "span", "input", "select", "textarea", "label", "button", "optgroup",
"option", "legend", "datalist", "keygen", "output", "progress", "meter", "area", "param", "source", "track",
"summary", "command", "device"
};
//emptyTags是不能有内容的标签,这类标签都是可以自闭合的
private static final String[] emptyTags = {
"meta", "link", "base", "frame", "img", "br", "wbr", "embed", "hr", "input", "keygen", "col", "command",
"device"
};
private static final String[] formatAsInlineTags = {
"title", "a", "p", "h1", "h2", "h3", "h4", "h5", "h6", "pre", "address", "li", "th", "td", "script", "style",
"ins", "del", "s"
};
//在这些标签里,需要保留空格
private static final String[] preserveWhitespaceTags = {
"pre", "plaintext", "title", "textarea"
};
另外,Jsoup的
Entities
类里包含了一些HTML实体转义的东西。这些转义的对应数据保存在entities-full.properties
和entities-base.properties
里。