Extracting Data from a Webpage

superfly · January 27, 2008, 4:21am

Hello everyone.

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

I give it the URL, say www.google.com. It then gives me “Google” - it’s
title.

Then also, is there anyway that the program could extract the next 5
characters - after a certain phrase that doesn’t change on the webpage?

Thanks for your help in advance!

superfly · January 27, 2008, 5:09am

On Jan 26, 2008, at 7:21 PM, Tj Superfly wrote:

characters - after a certain phrase that doesn’t change on the
webpage?

Thanks for your help in advance!

Posted via http://www.ruby-forum.com/.

http://code.whytheluckystiff.net/hpricot/

It’s a snap.

superfly · January 27, 2008, 5:42am

Tj Superfly wrote:

Hello everyone.

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

I give it the URL, say www.google.com. It then gives me “Google” - it’s
title.

Then also, is there anyway that the program could extract the next 5
characters - after a certain phrase that doesn’t change on the webpage?

Thanks for your help in advance!

You can do something like this:

require ‘open-uri’

url = “http://www.google.com”

open(url) do |f|
f.each do |line|
if md_obj = /(.*)</title>/.match(line)
puts md_obj[1]
end

if md_obj = /type=(.{6})/.match(line)
  puts md_obj[1]
end

end
end

Ruby also has various html parsing libraries that allow you to search
html documents by tag name, tag position, etc.

superfly · January 27, 2008, 6:21am

7stud – wrote:

You can do something like this:

require ‘open-uri’

url = “http://www.google.com”

open(url) do |f|
f.each do |line|
if md_obj = /(.*)</title>/.match(line)
puts md_obj[1]
end
if md_obj = /type=(.{6})/.match(line)
  puts md_obj[1]
end
end
end

This should be more efficient:

require ‘open-uri’

url = “http://www.google.com”
title_re = Regexp.new(/(.*)</title>/)
text_re = Regexp.new(/type=(.{5})/)

open(url) do |f|
f.each do |line|
if md_obj = title_re.match(line)
puts md_obj[1]
end

if md_obj = text_re.match(line)
  puts md_obj[1]
  break
end

end
end

–output:
Google
hidde #first 5 chars of ‘hidden’

superfly · January 27, 2008, 5:34pm

This should be more efficient:

require ‘open-uri’

url = “http://www.google.com”
title_re = Regexp.new(/(.*)</title>/)
text_re = Regexp.new(/type=(.{5})/)

open(url) do |f|
f.each do |line|
if md_obj = title_re.match(line)
puts md_obj[1]
end
if md_obj = text_re.match(line)
  puts md_obj[1]
  break
end
end
end

–output:
Google
hidde #first 5 chars of ‘hidden’

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions? I did try the other clip of code posted here, but got
more errors than this one. =/ I’m reading up on that link posted in the
2nd post to see if I can figure any of this out.

Thanks.

superfly · January 27, 2008, 9:25pm

Tj Superfly wrote:

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

Learn some basic ruby?
Learn how to post a question on a computer programming forum?

superfly · January 27, 2008, 8:50am

On Jan 26, 9:21 pm, Tj Superfly [email protected] wrote:

Hello everyone.

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

I give it the URL, say www.google.com. It then gives me “Google” - its
title.

“www.google.com”[/(www.)?(.)./,2].capitalize
==>“Google”
“google.com”[/(www.)?(.)./,2].capitalize
==>“Google”

superfly · January 27, 2008, 11:26pm

Tj Superfly wrote:

7stud – wrote:

Tj Superfly wrote:

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

Learn some basic ruby?

Learn how to post a question on a computer programming forum?

Anyone else know what the matter is?

How to post a question on a computer programming Forum:

Post a simple example program that demonstrates your problem.
Post the error message in its entirety–not an unintelligible portion
of it.
Post your question about the code.
Use a descriptive title for your post-- not something like
“URGENT…HELP ME!”
Proof read and spell check your post before clicking submit.

superfly · January 27, 2008, 11:50pm

On Jan 27, 2008, at 4:57 PM, Tj Superfly wrote:

7stud – wrote:

Tj Superfly wrote:

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

I believe that $end means you’re missing some sort of end delimiter,
but NOT ‘end’. Check for {} or / / for regexp

Also, if you can, have your editor do an autoformat thing so you can
see where the indentation screws up.

superfly · January 28, 2008, 12:16am

http://code.whytheluckystiff.net/hpricot/
It’s a snap.

I believe hpricot, as fine as it may be, is a little bit overkill for
such a task.

At best a simple task should remain simple, at least as simple as
possible.

superfly · January 27, 2008, 10:57pm

7stud – wrote:

Tj Superfly wrote:

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

Learn some basic ruby?

Learn how to post a question on a computer programming forum?

Anyone else know what the matter is?

superfly · January 28, 2008, 3:00am

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

require ‘net/http’
puts Net::HTTP.new(‘www.google.com’).get(‘/’).
body[/(.*?)<.title>/i,1]

superfly · January 28, 2008, 2:39am

7stud – wrote:

title_re = Regexp.new(/(.*)</title>/)

While that regex works for www.google.com, in order for the regex to be
more general, the regex should be:

title_re = Regexp.new(/(.*)</title>/m)

and then to output the match:

puts md_obj[1].strip()

superfly · January 28, 2008, 5:42am

7stud – wrote:

Nice. I tested your code and it works for me. But my reading of the
docs says that it shouldn’t work: new() doesn’t open a connection, and
get(), “Gets data from path on the connected-to host.” The docs seem
to want you to do something like:

resp_obj = Net::HTTP.get_response(‘http://www.google.com’,
‘/index.html’)
page = resp_obj.body

Reading the Net::HTTP docs here:
http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M001127

It says that Net::HTTP#get will:
“Send a GET request to the target and return the response as a string”

and Net::HTTP#get_response will:
“Send a GET request to the target and return the response as a
Net::HTTPResponse object”

The #new in this case is optional because both methods are class methods
or instance methods? Someone might be able to clarify this a part a
little more. But the examples at that doc url don’t even use
New::HTTP#new.

superfly · January 28, 2008, 5:24am

William J. wrote:

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

require ‘net/http’
puts Net::HTTP.new(‘www.google.com’).get(‘/’).
body[/(.*?)<.title>/i,1]

Nice. I tested your code and it works for me. But my reading of the
docs says that it shouldn’t work: new() doesn’t open a connection, and
get(), “Gets data from path on the connected-to host.” The docs seem
to want you to do something like:

resp_obj = Net::HTTP.get_response(‘http://www.google.com’,
‘/index.html’)
page = resp_obj.body

superfly · January 28, 2008, 10:58am

Joseph P. wrote:

7stud – wrote:

Nice. I tested your code and it works for me. But my reading of the
docs says that it shouldn’t work: new() doesn’t open a connection, and
get(), “Gets data from path on the connected-to host.” The docs seem
to want you to do something like:

resp_obj = Net::HTTP.get_response(‘http://www.google.com’,
‘/index.html’)
page = resp_obj.body

Reading the Net::HTTP docs here:
http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M001127

It says that Net::HTTP#get will:
“Send a GET request to the target and return the response as a string”

and Net::HTTP#get_response will:
“Send a GET request to the target and return the response as a
Net::HTTPResponse object”

The #new in this case is optional because both methods are class methods
or instance methods?

According to the docs, Net::HTTP has class methods:

get()
get_response()

and an instance method:

get()

As with all ruby classes, new() creates an instance. Therefore, in the
code example I was wondering about:

puts Net::HTTP.new(‘www.google.com’).get(‘/’).
body[/(.*?)<.title>/i,1]

new() creates an instance, which is being used to call get(), so the
version of get() being called is the instance method. Yet, the docs say
the get() instance method “Gets data from path on the connected-to
host”. What connected to host? According to the docs on new() it says,
“This method does not open the TCP connection.”

In addition, the get() version in that code cannot be the class method
version because the class method version returns a String and Strings do
not have a body() method, which is the next method call.

Someone might be able to clarify this a part a
little more. But the examples at that doc url don’t even use
New::HTTP#new.

superfly · January 28, 2008, 11:08am

7stud – wrote:

As with all ruby classes, new() creates an instance. Therefore, in the
code example I was wondering about:

puts Net::HTTP.new(‘www.google.com’).get(‘/’).
body[/(.*?)<.title>/i,1]

new() creates an instance, which is being used to call get(), so the
version of get() being called is the instance method. Yet, the docs say
the get() instance method “Gets data from path on the connected-to
host”. What connected to host? According to the docs on new() it says,
“This method does not open the TCP connection.”

As far as I can tell, you should have to call start() on a Net::HTTP
instance in order to open a connection, e.g.:

str = Net::HTTP.new(‘www.google.com’).start().get(‘/’).body