Forum: Ferret Someone getting RDig work for Linux?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
8b19bf87f6c5ff8a3613704eea16123b?d=identicon&s=25 ngoc (Guest)
on 2007-01-23 15:55
I got this

root@linux:~# rdig -c configfile
RDig version 0.3.4
using Ferret 0.10.14
added url file:///home/myaccount/documents/
waiting for threads to finish...
root@linux:~# rdig -c configfile -q "Ruby"
RDig version 0.3.4
using Ferret 0.10.14
executing query >Ruby<
Query:
total results: 0
root@linux:~#



my configfile
I changed from config to cfg, because of maybe mistyping
cfg.index.create = false

RDig.configuration do |cfg|

  ##################################################################
  # options you really should set

  # provide one or more URLs for the crawler to start from
  cfg.crawler.start_urls = [ 'http://www.example.com/' ]

  # use something like this for crawling a file system:
   cfg.crawler.start_urls = [ 'file:///home/myaccount/documents/' ]
  # beware, mixing file and http crawling is not possible and might
result in
  # unpredictable results.

  # limit the crawl to these hosts. The crawler will never
  # follow any links pointing to hosts other than those given here.
  # ignored for file system crawling
  cfg.crawler.include_hosts = [ 'www.example.com' ]

  # this is the path where the index will be stored
  # caution, existing contents of this directory will be deleted!
  cfg.index.path        = '/home/myaccount/index'

  ##################################################################
  # options you might want to set, the given values are the defaults

  # set to true to get stack traces on errors
   cfg.verbose = true

  # content extraction options
  cfg.content_extraction = OpenStruct.new(

  # HPRICOT configuration
  # this is the html parser used by default from RDig 0.3.3 upwards.
  # Hpricot by far outperforms Rubyful Soup, and is at least as flexible
when
  # it comes to selection of portions of the html documents.
    :hpricot      => OpenStruct.new(
      # css selector for the element containing the page title
      :title_tag_selector => 'title',
      # might also be a proc returning either an element or a string:
      # :title_tag_selector => lambda { |hpricot_doc| ... }
      :content_tag_selector => 'body'
      # might also be a proc returning either an element or a string:
      # :content_tag_selector => lambda { |hpricot_doc| ... }
    )

  # RUBYFUL SOUP
  # This is a powerful, but somewhat slow, ruby-only html parsing lib
which was
  # RDig's default html parser up to version 0.3.2. To use it, comment
the
  # hpricot config above, and uncomment the following:
  #
  #  :rubyful_soup => OpenStruct.new(
  #    # provide a method that returns the title of an html document
  #    # this method may either return a tag to extract the title from,
  #    # or a ready-to-index string.
  #    :content_tag_selector => lambda { |tagsoup|
  #      tagsoup.html.body
  #    },
  #    # provide a method that selects the tag containing the page
content you
  #    # want to index. Useful to avoid indexing common elements like
navigation
  #    # and page footers for every page.
  #    :title_tag_selector         => lambda { |tagsoup|
  #      tagsoup.html.head.title
  #    }
  #  )
  )

  # crawler options

  # Notice: for file system crawling the include/exclude_document
patterns are
  # applied to the full path of _files_ only (like /home/bob/test.pdf),
  # for http to full URIs (like http://example.com/index.html).

  # nil (include all documents) or an array of Regexps
  # matching the URLs you want to index.
   cfg.crawler.include_documents = nil

  # nil (no documents excluded) or an array of Regexps
  # matching URLs not to index.
  # this filter is used after the one above, so you only need
  # to exclude documents here that aren't wanted but would be
  # included by the inclusion patterns.
  # cfg.crawler.exclude_documents = nil

  # number of document fetching threads to use. Should be raised only if
  # your CPU has idle time when indexing.
  # cfg.crawler.num_threads = 2
  # suggested setting for file system crawling:
   cfg.crawler.num_threads = 1

  # maximum number of http redirections to follow
  # cfg.crawler.max_redirects = 5

  # number of seconds to wait with an empty url queue before
  # finishing the crawl. Set to a higher number when experiencing
incomplete
  # crawls on slow sites. Don't set to 0, even when crawling a local fs.
   cfg.crawler.wait_before_leave = 10

  # indexer options

  # create a new index on each run. Will append to the index if false.
Use when
  # building a single index from multiple runs, e.g. one across a
website and the
  # other a tree in a local file system
   cfg.index.create = false

  # rewrite document uris before indexing them. This is useful if you're
  # indexing on disk, but the documents should be accessible via http,
e.g. from
  # a web based search application. By default, no rewriting takes
place.
  # example:
  # cfg.index.rewrite_uri = lambda { |uri|
  #   uri.path.gsub!(/^\/base\//, '/virtual_dir/')
  #   uri.scheme = 'http'
  #   uri.host = 'www.mydomain.com'
  # }

end
C9dd93aa135988cabf9183d3210665ca?d=identicon&s=25 Jens Kraemer (Guest)
on 2007-01-23 17:44
(Received via mailing list)
On Tue, Jan 23, 2007 at 03:55:06PM +0100, ngoc wrote:
> executing query >Ruby<
> Query:
> total results: 0
> root@linux:~#

strange. I cut'n'pasted your config and only changed the start_urls and
index location, and it worked like a charm. what is in the documents
directory - only files, or subdirectories, any strange file names
(spaces
and such)? There's a known bug concerning spaces in file/directory
names, maybe that's the problem?

Jens

--
webit! Gesellschaft für neue Medien mbH          www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer       kraemer@webit.de
Schnorrstraße 76                         Tel +49 351 46766  0
D-01069 Dresden                          Fax +49 351 46766 66
8b19bf87f6c5ff8a3613704eea16123b?d=identicon&s=25 ngoc (Guest)
on 2007-01-23 18:48
> and such)? There's a known bug concerning spaces in file/directory
> names, maybe that's the problem?
Hi Jens
I stored only one file in the catalogue. And it has space in file name
without ending. So I correct it with connected name and ending html ->
It works.

I recognise that I need to work more with it before taking in use. It is
so linux oriented. Now I have to read line by line to learn more how it
works inside. It will take long time.

Thanks Jens

ngoc
C9dd93aa135988cabf9183d3210665ca?d=identicon&s=25 Jens Kraemer (Guest)
on 2007-01-24 10:09
(Received via mailing list)
Hi!

On Tue, Jan 23, 2007 at 06:48:03PM +0100, ngoc wrote:
> > and such)? There's a known bug concerning spaces in file/directory
> > names, maybe that's the problem?
> Hi Jens
> I stored only one file in the catalogue. And it has space in file name
> without ending. So I correct it with connected name and ending html ->
> It works.

ah ok. The filename ending is needed, since there is no other (easy) way
to get an idea what kind of content extractor to use. On *nix systems
the 'file' command might be of use here, but that would even more tie
RDig to Linux and friends...

> I recognise that I need to work more with it before taking in use. It is
> so linux oriented. Now I have to read line by line to learn more how it
> works inside. It will take long time.

sorry for the inconvenience, but I only rarely get to use something else
than Linux - however I'll happily apply any fixes to make RDig work on
windows. However I'll fix the problem with spaces in filenames by the
end of the week.

cheers,
Jens


--
webit! Gesellschaft für neue Medien mbH          www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer       kraemer@webit.de
Schnorrstraße 76                         Tel +49 351 46766  0
D-01069 Dresden                          Fax +49 351 46766 66
This topic is locked and can not be replied to.