Nokogiri/ruby and troublesome characters in url

I’m very new to using ruby, and I can’t seem to figure something out
(that is probably quite basic). Any help is much appreciated!

When using nokogiri and open-uri in Ruby, I define a variable containing
a partial url (INITIAL_URL =
https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=ParlementaireDocumenten”)
so as to be able to add onto the url for continuous use (I have added
the full code below).

However, I keep running into an error. “syntax error, unexpected tLABEL”

  • “unknown regexp options - zk” + "syntax error, unexpected ‘?’

How can I fix this?..

Here’s the full code:

irb
require ‘Nokogiri’
require ‘open-uri’

def get_search_result_links(n_page)

links = n_page.css(‘.linker-kolom li a’)
puts “** There were #{links.length} links found”
links.each do |link|
href = link[‘href’]
inner_url = ‘https://zoek.officielebekendmakingen.nl’ + href
puts “\n\n\nFetching page at #{File.basename(inner_url).split(‘?’)[0]}”

datalezer = open(inner_url).read
lokalenieuwefilenaam = href + “.html”
lokalenieuwefile = open(lokalenieuwefilenaam, “w”)
lokalenieuwefile.write(datalezer)
lokalenieuwefile.close
end
end

INITIAL_URL =
https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=ParlementaireDocumenten
initial_page = Nokogiri::HTML(open(INITIAL_URL))
pagination_links = initial_page.css(‘.paginering.beneden a’)
last_page_link = pagination_links[-2]
last_page_number = last_page_link.text.to_i
(5…last_page_number).each do |page_num|
puts “\n\n\n***** Getting page #{page_num}”
results_page_url = “#{INITIAL_URL}&_page=#{page_num}”
results_page = Nokogiri::HTML(open(results_page_url))
get_search_result_links(results_page)
end

(In my setup) the line…

pagination_links = initial_page.css(‘.paginering.beneden a’)

returns an empty Nokogiri::XML::NodeSet => []

What part of your html are you trying to select?

Something googled… Parsing HTML with Nokogiri | The Bastards Book of Ruby

Abinoam Jr.

Thanks for the reply Abinoam.

With pagination_links = initial_page.css(‘.paginering.beneden a’) I’m
trying to recover

and then , which refer to all the page-links. So apparently something
is going wrong here aswell?..

The bigger problem I’m dealing with is that ruby believes that letters
following the question mark (in INITIAL_URL =
https://zoek.officielebekendmakingen.nl/zoeken/resultaat.?zkt=Uitgebreid&pst=ParlementaireDocumenten”)
should be interpreted as commands in stead of part of the entire string.
So I get an error when simply trying to define INITIAL_URL with a
url-string, because some of the characters in the url are interpreted as
commands.

Dear Sybren,

I’ve indented and fixed some quotes on your code.

It runs, but there’s no “paginering beneden” on the html retrieved by
it.
So, the code fails at “pagination_links =
initial_page.css(‘.paginering.beneden a’)”

Look:

initial_page.css(‘.paginering’) => []
initial_page.css(‘.beneden’) => []

But, as an example…
initial_page.css(‘.tekst-kleiner’)
initial_page.css(‘a.tekst-kleiner’)
initial_page.css(‘header’).css(‘a.tekst-kleiner’)

all returns…
=> [#<Nokogiri::XML::Element:0x109e3d0 name=“a”
attributes=[#<Nokogiri::XML::Attr:0x109e358 name=“href”
value=“https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=ParlementaireDocumenten&grootte=2”>,
#<Nokogiri::XML::Attr:0x109e344 name=“class” value=“tekst-kleiner”>,
#<Nokogiri::XML::Attr:0x109e308 name=“title” value=“Schermteksten
verkleinen”>] children=[#<Nokogiri::XML::Text:0x10a2854 “”>]>]

Look the html source of your url and you will see it.

Best regards,
Abinoam Jr.

Are you testing your code by inserting an href by hand, something like
this:

inner_url = 'https://zoek.officielebekendmakingen.nl' + 

/something/zk?x=10&y=5

That produces the error:

unknown regexp options - zk

The reason for that error is that /something/ is the syntax for a regex
literal.