I’m trying to strip html with the exception of a few html tags.
I have found the following code:
text = “”
tokenizer = HTML::Tokenizer.new(html)
while token = tokenizer.next node = HTML::Node.parse(nil, 0, 0, token, false) # result is only the content of any Text nodes text << node.to_s if node.class == HTML::Text end # strip any comments, and if they have a newline at the end (ie.
# only a comment) strip that too
html # already plain text
I’m trying to understand what is going on in this code but cannot find
documenation for HTML::Tokenizer or HTML::Node.parse. Does anyone know
the use of the parameters in the parse method?
In the while loop, how do you access the html tag. If I could access
the html tags, I could then decide if I wanted to keep the tag or not.
Thanks for reading,