Forum: Ruby character safe CSV parser.

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
(Guest)
on 2005-12-23 11:48
(Received via mailing list)
I was running into difficulties with the CSV library in Ruby. I had
some files that were exports from a Filemaker database, and it had
newline and vtab characters within strings. This seemed to cause
problems for the library. I ended up making my own method that would
parse a file character by character (not using readline). I know that
it might be better to use a Regex expression, or specify the character
delimiter for rows in the readline method. But the method I made seems
a bit flexible for different types of characters. Please feel free to
use it or rip it apart. Any suggestions are welcome as well.

# linesafe_parse_csv by Sean W. ( sean at i heart squares dot com)
# (c) 2005 Sean W. (GPL license applies)
# Implementation of a CSV parser that is safe to use with
# strings that may contain newline or other special characters.
# Accepts arguments to specify the Field, string and row delimiter,
# along with an escape character, and a character stripper.
#  A block can be passed to the method and will be passed an array
# of strings for each row.
#
# Example:
#      Column_Names = [ :id, :first_name, :last_name, :email ]
#      table = {}
#      file = File.open("mycsv_file.csv", "r")
#      linesafe_parse_csv(file, ",", '"', "\r", "\\", "\v") do
|csv_row|
#      table_row = {}
#      for index in 0...csv_row.length
#          table_row[Column_Names[index]] = csv_row[index]
#      end
#      table[table_row[:id]] = table_row
#    end
#    file.close
#    table
  def linesafe_parse_csv(file, cell_delim, string_delim, row_delim,
esc_delim, chars_to_elim)
    # reading characters from a file returns a fixednum
    # this conversion of the string will help comparisons
    str_dim_i = string_delim[0]
    cell_dim_i = cell_delim[0]
    row_dim_i = row_delim[0]
    esc_dim_i = esc_delim[0]

    # loop until the end of file
    while !file.eof?
      row = []
      in_str = false
      in_esc = false
      newrow = false
      value = ""

      # loop throught and parse a row.
      while !newrow && !file.eof?
        char = file.getc

        # handle what to do with the char
        if char == str_dim_i
          if !in_str
            in_str = true
          elsif !in_esc
            in_str = false
          else
            value << char
            in_esc = false
          end
        elsif char == esc_dim_i
          if !in_esc
            in_esc = true
          else
            value << char
            in_esc = false
          end
        elsif char == row_dim_i
          if !in_str
            # handle nil values
            if value == ''
              row << nil
            else
              # we strip any unwanted characters before
              # adding them to the row array
              row << value.tr(chars_to_elim, '').strip
              value = ''
            end
            newrow = true
          else
            value << char
          end
        elsif char == cell_dim_i
          if !in_str && !in_esc
            # handle nil values
            if value == ''
              row << nil
            else
              row << value.tr(chars_to_elim, '').strip
              value = ''
            end
            value = ''
          elsif in_esc
            value << char
            in_esc = false
          else
            value << char
          end
        else
          value << char
        end
      end

      #return the row to the calling function
      yield row
    end
  end
James G. (Guest)
on 2005-12-23 19:08
(Received via mailing list)
On Dec 23, 2005, at 3:47 AM, removed_email_address@domain.invalid wrote:

> I was running into difficulties with the CSV library in Ruby. I had
> some files that were exports from a Filemaker database, and it had
> newline and vtab characters within strings.

In quoted or unquoted fields?  If it was quoted, I'm confident
FasterCSV[1] would parse it correctly.  If it's unquoted, it's
malformed CSV and all bets are off.  ;)

1:  http://rubyforge.org/projects/fastercsv/

James Edward G. II
(Guest)
on 2005-12-23 20:23
(Received via mailing list)
Interesting. I didn't see this package when I was searching for info on
CSV and Ruby. The fields do use quoted text. The reason I looked into
my own method was that the CSV Library in Ruby , and a lot of other
methods would not work right if a string(inside quotes) had a newline
character (or other control characters for that matter).

I'll try out the FasterCSV, and see how it works for me.

Thanks.
James G. (Guest)
on 2005-12-23 20:29
(Received via mailing list)
On Dec 23, 2005, at 12:22 PM, removed_email_address@domain.invalid wrote:

> I'll try out the FasterCSV, and see how it works for me.

Great.  An if you run into problems, please let me know because it is
*suppose* to work...  ;)

James Edward G. II
(Guest)
on 2005-12-23 22:43
(Received via mailing list)
Hi, sorry if this seems a little daft, but in your documents you
frequently refere to a file called FasterCSV, which I imagine to be a
document of some kind on it's dev use. None of the archives on
RubyForge seem to contain that file. Even looking at faster_csv.rb
refers to this file.

Is it somewhere where I'm not looking?
James G. (Guest)
on 2005-12-23 22:55
(Received via mailing list)
On Dec 23, 2005, at 2:42 PM, removed_email_address@domain.invalid wrote:

> Hi, sorry if this seems a little daft, but in your documents you
> frequently refere to a file called FasterCSV, which I imagine to be a
> document of some kind on it's dev use. None of the archives on
> RubyForge seem to contain that file. Even looking at faster_csv.rb
> refers to this file.
>
> Is it somewhere where I'm not looking?

I'm not sure I understand the question.

FasterCSV is the primary interface class the library provides you.
The class is documented here:

http://fastercsv.rubyforge.org/classes/FasterCSV.html

If that didn't cover what you were asking, try me again.  I must just
be misunderstanding...

James Edward G. II
(Guest)
on 2005-12-23 23:28
(Received via mailing list)
Ahh there we go! I guess what I was trying to ask is what you just sent
me. But in the README and FasterCSV file I see this:

README (line 49 -51):
"== Documentation

See FasterCSV for documentation."

and this in faster_csv.rb (Line 8)
"# See FasterCSV for documentation. "

Because of that I was assuming that there should have been a file
called "FasterCSV" like the "README" file in the project package. I
didn't see any URL anywhere.

But regardless, I get the following error:
Unquoted fields do not allow \r or \n.

RAILS_ROOT: ./script/../config/..
Application Trace | Framework Trace | Full Trace

c:/dev/ruby/lib/ruby/gems/1.8/gems/fastercsv-0.1.4/lib/faster_csv.rb:408:in
`shift'
#{RAILS_ROOT}/app/controllers/import_controller.rb:41:in `join'
#{RAILS_ROOT}/app/controllers/import_controller.rb:41:in `upload'


This is strange since all fields are quoted. It could be because of the
Vertical tab characters that are in the file.

I called the library like so....
        faster_csv = FasterCSV.new(file.read, :row_sep => "\r")
        faster_csv.each do |csv_row|
            table_row = {}
            for index in 0...csv_row.length
                table_row[columns[index]] = csv_row[index].sub("\v",
"\n") if csv_row[index]
            end
            table[table_row[id_symbol]] = table_row
        end


The CSV file is an export from a user's FileMaker database that will
then get uploaded to this rails app to be parsed into a MySql database.
Here is a little exerpt from FileMaker about their CSV export format (I
hope this is of help):

>
>
>6. No other control characters (<=$1F (less than or equal to decimal 31)) are generated 
during export, but embedded control characters are exported as themselves excepting as 
specified in #2 and #3 above.
>
>7. Accented characters are exported as themselves without remapping from the platform's 
normal character set: ' $8C (140) is exported as $8C (decimal 140) on the Mac - it is NOT 
remapped to $86 (decimal 134) which is the equivalent ASCII character.



Sean
James G. (Guest)
on 2005-12-23 23:43
(Received via mailing list)
On Dec 23, 2005, at 3:27 PM, removed_email_address@domain.invalid wrote:

> "# See FasterCSV for documentation. "
>
> Because of that I was assuming that there should have been a file
> called "FasterCSV" like the "README" file in the project package. I
> didn't see any URL anywhere.

RDoc makes sure it gets linked up correctly when it generates the
documentation for me.

> But regardless, I get the following error:
> Unquoted fields do not allow \r or \n.

We're probably boring the others with this discussion, so now is a
good time to take it off-list.  Please respond to this message
privately (removed_email_address@domain.invalid) and I will help you resolve 
this.

I suspect it's a line-ending issue.  Is the file something you can
zip up and send to me?  I bet I can clear it up pretty quickly that
way...

James Edward G. II
This topic is locked and can not be replied to.