Previously I posted a topic on how to strip all html tags and getting
the remaining text using regexp. Luckily I got one. This is the regexp:
In this case I’m able to get all the data between the html tags. But one
small problem. I’m getting output like this :
Example Web Page
You have reached this web page by typing "example.com",
or "example.org" into your web browser.
These domain names are reserved for use in documentation and are not
for registration. See RFC
2606, Section 3.
This is the output which I get when I parse the html content of
example.com using the above regexp. Here you can see some white space
between the data(ie. between ‘Example web page’ and ‘You have
reached…’. These whitespaces are generated in place of the html tags
which I avoided using the above regexp. I want to remove those
whitespaces. I think that modifying the above regexp will give me the
right output without white spaces. Can somebody please help me.