Someone getting RDig work for Linux?

ngoc · January 23, 2007, 3:55pm

I got this

root@linux:~# rdig -c configfile
RDig version 0.3.4
using Ferret 0.10.14
added url file:///home/myaccount/documents/
waiting for threads to finish…
root@linux:~# rdig -c configfile -q “Ruby”
RDig version 0.3.4
using Ferret 0.10.14
executing query >Ruby<
Query:
total results: 0
root@linux:~#

my configfile
I changed from config to cfg, because of maybe mistyping
cfg.index.create = false

RDig.configuration do |cfg|

##################################################################

options you really should set

provide one or more URLs for the crawler to start from

cfg.crawler.start_urls = [ ‘http://www.example.com/’ ]

use something like this for crawling a file system:

cfg.crawler.start_urls = [ ‘file:///home/myaccount/documents/’ ]

beware, mixing file and http crawling is not possible and might

result in

unpredictable results.

limit the crawl to these hosts. The crawler will never

follow any links pointing to hosts other than those given here.

ignored for file system crawling

cfg.crawler.include_hosts = [ ‘www.example.com’ ]

this is the path where the index will be stored

caution, existing contents of this directory will be deleted!

cfg.index.path = ‘/home/myaccount/index’

##################################################################

options you might want to set, the given values are the defaults

set to true to get stack traces on errors

cfg.verbose = true

content extraction options

cfg.content_extraction = OpenStruct.new(

HPRICOT configuration

this is the html parser used by default from RDig 0.3.3 upwards.

Hpricot by far outperforms Rubyful Soup, and is at least as flexible

when

it comes to selection of portions of the html documents.

:hpricot      => OpenStruct.new(
  # css selector for the element containing the page title
  :title_tag_selector => 'title',
  # might also be a proc returning either an element or a string:
  # :title_tag_selector => lambda { |hpricot_doc| ... }
  :content_tag_selector => 'body'
  # might also be a proc returning either an element or a string:
  # :content_tag_selector => lambda { |hpricot_doc| ... }
)

RUBYFUL SOUP

This is a powerful, but somewhat slow, ruby-only html parsing lib

which was

RDig’s default html parser up to version 0.3.2. To use it, comment

the

hpricot config above, and uncomment the following:

:rubyful_soup => OpenStruct.new(

# provide a method that returns the title of an html document

# this method may either return a tag to extract the title from,

# or a ready-to-index string.

:content_tag_selector => lambda { |tagsoup|

tagsoup.html.body

},

# provide a method that selects the tag containing the page

content you

# want to index. Useful to avoid indexing common elements like

navigation

# and page footers for every page.

:title_tag_selector => lambda { |tagsoup|

tagsoup.html.head.title

}

)

crawler options

Notice: for file system crawling the include/exclude_document

patterns are

applied to the full path of files only (like /home/bob/test.pdf),

for http to full URIs (like Example Domain).

nil (include all documents) or an array of Regexps

matching the URLs you want to index.

cfg.crawler.include_documents = nil

nil (no documents excluded) or an array of Regexps

matching URLs not to index.

this filter is used after the one above, so you only need

to exclude documents here that aren’t wanted but would be

included by the inclusion patterns.

cfg.crawler.exclude_documents = nil

number of document fetching threads to use. Should be raised only if

your CPU has idle time when indexing.

cfg.crawler.num_threads = 2

suggested setting for file system crawling:

cfg.crawler.num_threads = 1

maximum number of http redirections to follow

cfg.crawler.max_redirects = 5

number of seconds to wait with an empty url queue before

finishing the crawl. Set to a higher number when experiencing

incomplete

crawls on slow sites. Don’t set to 0, even when crawling a local fs.

cfg.crawler.wait_before_leave = 10

indexer options

create a new index on each run. Will append to the index if false.

Use when

building a single index from multiple runs, e.g. one across a

website and the

other a tree in a local file system

cfg.index.create = false

rewrite document uris before indexing them. This is useful if you’re

indexing on disk, but the documents should be accessible via http,

e.g. from

a web based search application. By default, no rewriting takes

place.

example:

cfg.index.rewrite_uri = lambda { |uri|

uri.path.gsub!(/^/base//, ‘/virtual_dir/’)

uri.scheme = ‘http’

uri.host = ‘www.mydomain.com’

}

end

ngoc · January 23, 2007, 5:44pm

On Tue, Jan 23, 2007 at 03:55:06PM +0100, ngoc wrote:

executing query >Ruby<
Query:
total results: 0
root@linux:~#

strange. I cut’n’pasted your config and only changed the start_urls and
index location, and it worked like a charm. what is in the documents
directory - only files, or subdirectories, any strange file names
(spaces
and such)? There’s a known bug concerning spaces in file/directory
names, maybe that’s the problem?

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

ngoc · January 23, 2007, 6:48pm

and such)? There’s a known bug concerning spaces in file/directory
names, maybe that’s the problem?
Hi Jens
I stored only one file in the catalogue. And it has space in file name
without ending. So I correct it with connected name and ending html ->
It works.

I recognise that I need to work more with it before taking in use. It is
so linux oriented. Now I have to read line by line to learn more how it
works inside. It will take long time.

Thanks Jens

ngoc

ngoc · January 24, 2007, 10:09am

Hi!

On Tue, Jan 23, 2007 at 06:48:03PM +0100, ngoc wrote:

and such)? There’s a known bug concerning spaces in file/directory
names, maybe that’s the problem?
Hi Jens
I stored only one file in the catalogue. And it has space in file name
without ending. So I correct it with connected name and ending html →
It works.

ah ok. The filename ending is needed, since there is no other (easy) way
to get an idea what kind of content extractor to use. On *nix systems
the ‘file’ command might be of use here, but that would even more tie
RDig to Linux and friends…

I recognise that I need to work more with it before taking in use. It is
so linux oriented. Now I have to read line by line to learn more how it
works inside. It will take long time.

sorry for the inconvenience, but I only rarely get to use something else
than Linux - however I’ll happily apply any fixes to make RDig work on
windows. However I’ll fix the problem with spaces in filenames by the
end of the week.

cheers,
Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66