Extracting Data from a Webpage

Hello everyone.

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

I give it the URL, say www.google.com. It then gives me “Google” - it’s
title.

Then also, is there anyway that the program could extract the next 5
characters - after a certain phrase that doesn’t change on the webpage?

Thanks for your help in advance!

On Jan 26, 2008, at 7:21 PM, Tj Superfly wrote:

characters - after a certain phrase that doesn’t change on the
webpage?

Thanks for your help in advance!

Posted via http://www.ruby-forum.com/.

http://code.whytheluckystiff.net/hpricot/

It’s a snap.

Tj Superfly wrote:

Hello everyone.

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

I give it the URL, say www.google.com. It then gives me “Google” - it’s
title.

Then also, is there anyway that the program could extract the next 5
characters - after a certain phrase that doesn’t change on the webpage?

Thanks for your help in advance!

You can do something like this:

require ‘open-uri’

url = “http://www.google.com

open(url) do |f|
f.each do |line|
if md_obj = /(.*)</title>/.match(line)
puts md_obj[1]
end

if md_obj = /type=(.{6})/.match(line)
  puts md_obj[1]
end

end
end

Ruby also has various html parsing libraries that allow you to search
html documents by tag name, tag position, etc.

7stud – wrote:

You can do something like this:

require ‘open-uri’

url = “http://www.google.com

open(url) do |f|
f.each do |line|
if md_obj = /(.*)</title>/.match(line)
puts md_obj[1]
end

if md_obj = /type=(.{6})/.match(line)
  puts md_obj[1]
end

end
end

This should be more efficient:

require ‘open-uri’

url = “http://www.google.com
title_re = Regexp.new(/(.*)</title>/)
text_re = Regexp.new(/type=(.{5})/)

open(url) do |f|
f.each do |line|
if md_obj = title_re.match(line)
puts md_obj[1]
end

if md_obj = text_re.match(line)
  puts md_obj[1]
  break
end

end
end

–output:
Google
hidde #first 5 chars of ‘hidden’

This should be more efficient:

require ‘open-uri’

url = “http://www.google.com
title_re = Regexp.new(/(.*)</title>/)
text_re = Regexp.new(/type=(.{5})/)

open(url) do |f|
f.each do |line|
if md_obj = title_re.match(line)
puts md_obj[1]
end

if md_obj = text_re.match(line)
  puts md_obj[1]
  break
end

end
end

–output:
Google
hidde #first 5 chars of ‘hidden’

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions? I did try the other clip of code posted here, but got
more errors than this one. =/ I’m reading up on that link posted in the
2nd post to see if I can figure any of this out.

Thanks.

Tj Superfly wrote:

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

  1. Learn some basic ruby?

  2. Learn how to post a question on a computer programming forum?

On Jan 26, 9:21 pm, Tj Superfly [email protected] wrote:

Hello everyone.

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

I give it the URL, say www.google.com. It then gives me “Google” - its
title.

www.google.com”[/(www.)?(.)./,2].capitalize
==>“Google”
google.com”[/(www.)?(.
)./,2].capitalize
==>“Google”

Tj Superfly wrote:

7stud – wrote:

Tj Superfly wrote:

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

  1. Learn some basic ruby?

  2. Learn how to post a question on a computer programming forum?

Anyone else know what the matter is?

How to post a question on a computer programming Forum:

  1. Post a simple example program that demonstrates your problem.

  2. Post the error message in its entirety–not an unintelligible portion
    of it.

  3. Post your question about the code.

  4. Use a descriptive title for your post-- not something like
    “URGENT…HELP ME!”

  5. Proof read and spell check your post before clicking submit.

On Jan 27, 2008, at 4:57 PM, Tj Superfly wrote:

7stud – wrote:

Tj Superfly wrote:

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

I believe that $end means you’re missing some sort of end delimiter,
but NOT ‘end’. Check for {} or / / for regexp

Also, if you can, have your editor do an autoformat thing so you can
see where the indentation screws up.

http://code.whytheluckystiff.net/hpricot/
It’s a snap.

I believe hpricot, as fine as it may be, is a little bit overkill for
such a task.

At best a simple task should remain simple, at least as simple as
possible.

7stud – wrote:

Tj Superfly wrote:

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

  1. Learn some basic ruby?

  2. Learn how to post a question on a computer programming forum?

Anyone else know what the matter is?

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

require ‘net/http’
puts Net::HTTP.new(‘www.google.com’).get(‘/’).
body[/(.*?)<.title>/i,1]

7stud – wrote:

title_re = Regexp.new(/(.*)</title>/)

While that regex works for www.google.com, in order for the regex to be
more general, the regex should be:

title_re = Regexp.new(/(.*)</title>/m)

and then to output the match:

puts md_obj[1].strip()

7stud – wrote:

Nice. I tested your code and it works for me. But my reading of the
docs says that it shouldn’t work: new() doesn’t open a connection, and
get(), “Gets data from path on the connected-to host.” The docs seem
to want you to do something like:

resp_obj = Net::HTTP.get_response(‘http://www.google.com’,
‘/index.html’)
page = resp_obj.body

Reading the Net::HTTP docs here:
http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M001127

It says that Net::HTTP#get will:
“Send a GET request to the target and return the response as a string”

and Net::HTTP#get_response will:
“Send a GET request to the target and return the response as a
Net::HTTPResponse object”

The #new in this case is optional because both methods are class methods
or instance methods? Someone might be able to clarify this a part a
little more. But the examples at that doc url don’t even use
New::HTTP#new.

William J. wrote:

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

require ‘net/http’
puts Net::HTTP.new(‘www.google.com’).get(‘/’).
body[/(.*?)<.title>/i,1]

Nice. I tested your code and it works for me. But my reading of the
docs says that it shouldn’t work: new() doesn’t open a connection, and
get(), “Gets data from path on the connected-to host.” The docs seem
to want you to do something like:

resp_obj = Net::HTTP.get_response(‘http://www.google.com’,
‘/index.html’)
page = resp_obj.body

Joseph P. wrote:

7stud – wrote:

Nice. I tested your code and it works for me. But my reading of the
docs says that it shouldn’t work: new() doesn’t open a connection, and
get(), “Gets data from path on the connected-to host.” The docs seem
to want you to do something like:

resp_obj = Net::HTTP.get_response(‘http://www.google.com’,
‘/index.html’)
page = resp_obj.body

Reading the Net::HTTP docs here:
http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M001127

It says that Net::HTTP#get will:
“Send a GET request to the target and return the response as a string”

and Net::HTTP#get_response will:
“Send a GET request to the target and return the response as a
Net::HTTPResponse object”

The #new in this case is optional because both methods are class methods
or instance methods?

According to the docs, Net::HTTP has class methods:

get()
get_response()

and an instance method:

get()

As with all ruby classes, new() creates an instance. Therefore, in the
code example I was wondering about:

puts Net::HTTP.new(‘www.google.com’).get(‘/’).
body[/(.*?)<.title>/i,1]

new() creates an instance, which is being used to call get(), so the
version of get() being called is the instance method. Yet, the docs say
the get() instance method “Gets data from path on the connected-to
host”. What connected to host? According to the docs on new() it says,
“This method does not open the TCP connection.”

In addition, the get() version in that code cannot be the class method
version because the class method version returns a String and Strings do
not have a body() method, which is the next method call.

Someone might be able to clarify this a part a
little more. But the examples at that doc url don’t even use
New::HTTP#new.

7stud – wrote:

As with all ruby classes, new() creates an instance. Therefore, in the
code example I was wondering about:

puts Net::HTTP.new(‘www.google.com’).get(‘/’).
body[/(.*?)<.title>/i,1]

new() creates an instance, which is being used to call get(), so the
version of get() being called is the instance method. Yet, the docs say
the get() instance method “Gets data from path on the connected-to
host”. What connected to host? According to the docs on new() it says,
“This method does not open the TCP connection.”

As far as I can tell, you should have to call start() on a Net::HTTP
instance in order to open a connection, e.g.:

str = Net::HTTP.new(‘www.google.com’).start().get(‘/’).body