Food Database (#159)

Note that because I am traveling tomorrow, I’ve posted this week’s
quiz a bit early.

The three rules of Ruby Q. 2:

  1. Please do not post any solutions or spoiler discussion for this
    quiz until 48 hours have passed from the time on this message.

  2. Support Ruby Q. 2 by submitting ideas as often as you can! (A
    permanent, new website is in the works for Ruby Q. 2. Until then,
    please visit the temporary website at

    http://matthew.moss.googlepages.com/home.

  3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem
helps everyone on Ruby T. follow the discussion. Please reply to
the original quiz message, if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Quiz #159
Food Database

There are numerous themes we have encountered across all of the past
Ruby Q.
problems, but there are a few that come back time and time again, albeit
sometimes in disguise. I can recall a number of quizzes that were best,
or most
easily, approached using pattern matching. Data searching is also a
common
theme, most often accessing the large, well-known databases of
vocabulary and
numbers.

This week we’re going to explore another large database that you might
not be familiar with: the USDA’s Nutrient Database. You can find out
about
this database at:
http://www.ars.usda.gov/services/docs.htm?docid=8964

The current database (SR20) can be downloaded from:
http://www.ars.usda.gov/Services/docs.htm?docid=15867

I recommend getting the abbreviated, ASCII download (a flat-file
database),
though those who want to experience the full brunt of the relational
database
are welcome to download that. I will focus on the abbreviated version,
since
it will serve our needs for this and future quizzes.

Opening the archive for the abbreviated database, you’ll find two files:

  • ABBREV.txt: this is the ASCII database
  • SR20_doc.pdf: a document describing the format and content of the
    abbreviated
    database.

(Note that SR20 now also contains a patch to the database. For the
purposes of
this quiz, I am not concerned whether you apply that patch or not. If
you
don’t want to worry about the patch, feel free to ignore it.)

The format of the database is fairly simple; the provided document
explains
the abbreviated file format beginning on page 29. To summarize, each
record
is a single line and contains more than a few delimited fields. Fields
are
separated by carets (^), and text fields are surrounded by tildes
(~).
The file is sorted by the first field, the food’s Nutrient Databank
Number
(NDB). Each line provides nutrient information for 100 grams of that
food.

Your task is to provide a function that will search this nutrient
database
for a food and provide information about it.

def nutrient_report(food, weight=100)
    # print report to stdout
end

Parameter food will be a string that is the food to locate. Keep in
mind
that there may be multiple entries that will simply match (a la grep)
the
parameter provided. You should only report on one of these foods at this
time;
which one to choose is up to you. You may want to consider a metric such
as
the Levenshtein Distance
(Levenshtein distance - Wikipedia)
while comparing food names against the search string.

Parameter weight is the weight to measure in grams, defaulting to
100.
(Recall that the nutrient information of each record of the database is
based upon 100 grams.) Your report should output numerical information
that
corresponds to the weight requested. There is information in the
document
provided that explains how to adjust for weight.

The output you provide is mostly up to you, but should include as a
minimum:

  • Full food name (as found in the database, not the search string)
  • Food weight (as provided to the function)
  • Nutrient values for:
    • Water
    • Protein
    • Carbohydrates (the Carbohydrt field)
    • Fats (sum of the fields FA_Sat, FA_Mono and FA_Poly)

A few more things to consider. First, the database contains information
for
over 7,500 food items. That may be a lot to search and do string
comparisons
on. If you find your searches going very slowly, consider caching the
data
to a more search-efficient format.

Second, consider writing some tests with database integrity in mind. For
example, at a quick glance, it appears that all the food names are
presented
in the database in full-caps. But if you base your search on this
assumption,
you may miss at least one food (or perhaps more) in your search, as at
least
one food was entered into ABBREV.txt in mixed-case. There may be other
errors
in the file, so consider doing a few sanity checks on the data file
before
diving into the heart of the quiz. (Feel free to post integrity test
code
to the mailing list before the waiting period is up.)

Third, and finally, part of the goal here is to make available another
large, interesting database for future Ruby Q. problems. There are
plenty
of opportunities available here… meal planning is just one example.
Keep this in mind while designing your solution: we want a firm
foundation
for searching this nutrient database so that future problems can focus
on
examining the results of the search.

On Mar 13, 11:35 pm, Matthew M. [email protected] wrote:

(Note that SR20 now also contains a patch to the database. For the purposes of
this quiz, I am not concerned whether you apply that patch or not. If you
don’t want to worry about the patch, feel free to ignore it.)

From what I read, the patch that’s available would update version 19
to version 20. If you download the current release from the site -
SR20 - you won’t need the patch.

(Note that SR20 now also contains a patch to the database. For the purposes of
this quiz, I am not concerned whether you apply that patch or not. If you
don’t want to worry about the patch, feel free to ignore it.)

From what I read, the patch that’s available would update version 19
to version 20. If you download the current release from the site -
SR20 - you won’t need the patch.

Actually, if you look at the patch, it’s only two food items, and
there would have been much more difference between SR19 and SR20 that
that.

The original published SR20 had a couple of errors in it; these errors
were found and corrected in Feb 2008, and that is what is in the
patch. Whether the updated SR20 zip file’s ABBREV.txt has that patch
applied… I don’t think so. I did a really, really quick search on
the first item and I think it differs between the ABBREV.txt and the
patch files. (Why they did it that way, I don’t know.)

On Mar 14, 6:26 am, Matthew M. [email protected] wrote:

that.

The original published SR20 had a couple of errors in it; these errors
were found and corrected in Feb 2008, and that is what is in the
patch. Whether the updated SR20 zip file’s ABBREV.txt has that patch
applied… I don’t think so. I did a really, really quick search on
the first item and I think it differs between the ABBREV.txt and the
patch files. (Why they did it that way, I don’t know.)

I see now. Very odd since the website mentions using it to update from
Release 19. Sorry for any confusion.

I’ve combined 3 of my own libraries, which resulted in a 30
minutes solution for this quiz.

The 3 parts are:

  • Hash#nearest instead of Hash#[]
  • Caching the result of an already parsed data file
  • Indexing of a data file of lines

Combine these parts into one sentence: “How can we find the
best matching key in a cached index of a data file?” Which was
pretty much the task of the quiz in one sentence…

I’ll explain my solution bottom-up, which roughly means: the
opposite of the order in which the code executes.

By the way, here’re the timings. Searching for the food
“QUINOA,CKD”. All runs result in the same output.

"quinoa,ckd "Exact key, not cached 0m1.078s
"quinoa,ckd "Exact key, cached 0m0.081s
"quinoa "Partial key, cached 0m0.159s
"qinoa "Partial key with typo, cached 0m0.179s

The Code + Explanation:
http://dark-code.bulix.org/srih9o-65866

gegroet,
Erik V. - http://www.erikveen.dds.nl/

that.

The original published SR20 had a couple of errors in it; these errors
were found and corrected in Feb 2008, and that is what is in the
patch. Whether the updated SR20 zip file’s ABBREV.txt has that patch
applied… I don’t think so. I did a really, really quick search on
the first item and I think it differs between the ABBREV.txt and the
patch files. (Why they did it that way, I don’t know.)

I see now. Very odd since the website mentions using it to update from
Release 19. Sorry for any confusion.

Well, to be sure, I’m not 100% of my claim. That was my
interpretation, but I haven’t confirmed that.

On Thu, Mar 20, 2008 at 7:04 AM, Matthew M. [email protected]
wrote:

Just curious… just trying to get a feel for what quizzes will work
and what won’t.

Hi Matthew,

Personally, I never got to the details of the quiz. The intro to the
quiz
told me that this isn’t that interesting:

"Data searching is also a common theme [in the Ruby quizzes], most often
accessing the large, well-known databases of vocabulary and numbers.

This week we’re going to explore another large database"

The summary for this quiz will be posted tomorrow morning (Friday).

I wonder… Seeing as how there was little discussion and only one
solution submitted, did y’all think this was:

  1. Too difficult?
  2. Too time consuming?
  3. Not interesting?
  4. Other?

Just curious… just trying to get a feel for what quizzes will work
and what won’t.

On Thu, Mar 20, 2008 at 1:04 PM, Matthew M. [email protected]
wrote:

The summary for this quiz will be posted tomorrow morning (Friday).

I wonder… Seeing as how there was little discussion and only one
solution submitted, did y’all think this was:

  1. Too difficult?
  2. Too time consuming?
  3. Not interesting?
  4. Other?
    1 maybe
    2 I would say yes
    3 yes partially because we have had the Levensthein distance already
    in one of the v1 Quizzes
    4 bad luck, happened to James too once or twice.

But I guess the killer was that it was too long(2), for what my
tupence (as this word is now defined on the list;) are worth
maybe you could have had a split quiz, you see one week about string
comparison and the next week about data caching, where everybody can
use the solutions of the previous week.

One could even dream about a series of quizzes that can be magically
be put together to one application in a final quiz of the series.

Cheers
Robert

Just curious… just trying to get a feel for what quizzes will work
and what won’t.


http://ruby-smalltalk.blogspot.com/


Whereof one cannot speak, thereof one must be silent.
Ludwig Wittgenstein

  1. Too difficult?
  2. Too time consuming?
  3. Not interesting?
  4. Other?

I have to admit that I had difficulties to understand what the quiz
description is aiming at when I read it for the first time. I
understood the use of the Levenshtein distance to be optional – but I
don’t quite see the benefit of it because it is likely to yield
erratic results. Since searches etc. should IMHO be case-insensitive,
I didn’t get caveat #2 (integrity checks, which probably would make
sense in the context of a relational database but IMHO not so much
when using only ABBREV.txt). Too many question marks at first sight.
And a lack of time to think it over once more (or ask for a
clarification). So maybe #2 and #4?

On Mar 20, 2008, at 7:04 AM, Matthew M. wrote:

I wonder… Seeing as how there was little discussion and only one
solution submitted, did y’all think this was:

  1. Too difficult?
  2. Too time consuming?
  3. Not interesting?
  4. Other?

Just curious… just trying to get a feel for what quizzes will work
and what won’t.

I looked at solving the quiz.

At first I thought I would just build a name-to-position-in-file index
and then binary search that to find matches. I think this makes for a
smooth solution to the current quiz, but it doesn’t really address
future quizzes that may want to query on different fields.

When I went back to the drawing board to account for that, my best
idea shifted to: stick the data in a relational database and query
that. Of course, this isn’t original (the database page tells you to
do this) and it doesn’t really show off any Ruby (save how to connect
to a database).

Given that, I was holding out for a better idea, but I never had one.
In short, I felt too dumb to play along this week. :slight_smile:

James Edward G. II

A few things, folks…

Because I’m rather busy at the moment doing all the final packing and
tasks necessary to do my move next week, I’m a little behind on quiz
things. Also, I will be away from Internet stuffs for a bit.

So, there will be no new quiz for two weeks. Apologies, but that’s
about when I expect I’ll be able to get back to this.

Also, the summary I promised this morning isn’t ready yet, but will
have that done this weekend. Unless…

Reading all your comments, I believe I could re-present the quiz in a
couple parts that would be less confusing and more approachable. But
it doesn’t make much sense to do that and summarize this week.

I do want to make this database usable for future quizzes, hence the
first need to have it searchable. I’m okay to summarize Erik’s
submission, but if y’all want to have another crack at a revised
version of this quiz, that’s cool too.

Comments?

Okay, giving consideration to response to the quiz, and that I need to
finish packing and move next week, I am going to revisit and rewrite
this quiz later. Thanks for everyone who provided valuable comments,
and to Erik who provided a solution.

As a reminder, I will be mostly unavailable for two weeks; Ruby Q.
will continue April 4th.

On Mar 21, 2008, at 7:32 AM, Matthew M. wrote:

Reading all your comments, I believe I could re-present the quiz in a
couple parts that would be less confusing and more approachable. But
it doesn’t make much sense to do that and summarize this week.

I do want to make this database usable for future quizzes, hence the
first need to have it searchable. I’m okay to summarize Erik’s
submission, but if y’all want to have another crack at a revised
version of this quiz, that’s cool too.

Comments?

I’m interested in seeing the revision. That’s my two cents.

James Edward G. II

BTW, I recently came across this site: http://nutridb.org

Can’t remember where it was posted. According to the FAQ, it’s based
on the SR-19 database.