Metadata 0.3

On 9/15/07, Konrad M. [email protected] wrote:

Any ideas?

Yeah, I failed at using git. Jeez. Sorry about that.
Here’s 0.3, it oughta work:

tarball: http://dark.fhtr.org/repos/metadata/metadata-0.3.tar.gz
git: http://dark.fhtr.org/repos/metadata

On 9/15/07, darren kirby [email protected] wrote:

Hi Ilmari!

Just wanted to mention that despite the name, wmainfo will parse anything
wrapped in an ASF audio/video container format[0], so, you could use it to
parse wmv movies as well if your user didn’t have mplayer installed.

[0] Advanced Systems Format - Wikipedia

Thanks for the pointer!
I made it merge the wmainfo output to the mplayer output for wmv and
asf.

Description

This package Metadata' comes with a library called metadata’ and
a small program called `mdh’.

The library probes files for their metadata (e.g. jpeg dimensions
and camera make, mp3 artist, pdf word count) and returns the metadata
as a Hash.

Mdh can print out file metadata as YAML and package the metadata
with the file.

This package has many dependencies since there is no single universal
metadata header format that all files use. Blame resource forks,
filename
extensions, bags of bytes and mimetypes.

Usage

print out metadata header

mdh -p myfile.jpg

create myfile.jpg.mdh, which consists of metadata header +

myfile.jpg
mdh myfile.jpg

print out metadata header from mdh file

mdh -e -p myfile.jpg.mdh

strip out metadata header from mdh file and save it to myfile.jpg

mdh -e myfile.jpg.mdh

irb> Metadata.extract(‘myfile.jpg’)
irb> Metadata.extract_text(‘myfile.jpg’)
irb> Pathname.new(“myfile.jpg”).metadata

List of supported formats

Audio:
Successfully tested with:
mp3, flac, ogg, wav
Should also work:
wma, m4a

Video:
What you manage to make mplayer play, which can be just about
anything.
Then again, missing title and author data, etc. (do videos even have
those?)
Successfully tested with:
wmv, mov, divx, xvid, flv, ogm, mpg

Images:
Should handle pretty much anything (apart from XCF and ORF.)
Successfully tested with:
jpeg, png, gif, nef, dng, crw, pef, psd

Documents:
Successfully tested with:
pdf, ppt, odp, sxi, ps, ps.gz, html, txt
Should work:
- OpenOffice docs work to some degree (personally, I’m using unoconv
to
convert OO docs to temp PDFs for the text & dimensions extraction,
so
those bits of data are missing.)
- MS Office docs to some degree (ppt at least, doc and xls should
work too,
dimensions missing due to the above temp PDF -thing.)

Others:
Whatever extract spits out on the five or six bits of metadata I’m
using
from it. Archive contents at least.

Requirements

  • Ruby 1.8

  • Tons of metadata extraction programs and libs,
    list of gems:
    flacinfo-rb
    wmainfo-rb
    MP4info
    list of debian packages:
    dcraw
    libimlib2-ruby
    extract
    libimage-exiftool-perl
    poppler-utils
    mplayer
    html2text
    imagemagick
    unhtml
    pstotext
    antiword
    catdoc
    shared-mime-info
    vorbis-tools

  • You do want to install the latest versions of dcraw and
    shared-mime-info to be able to handle camera raw images.
    http://cybercom.net/~dcoffin/dcraw/
    shared-mime-info

  • Python + chardet library
    http://chardet.feedparser.org/

Install

De-compress archive and enter its top directory.
Then type:

($ su)
# ruby setup.rb

These simple step installs this program under the default
location of Ruby libraries. You can also install files into
your favorite directory by supplying setup.rb some options.
Try “ruby setup.rb --help”.

License

Ruby’s

Quoth Ilmari H.:

On 9/15/07, Konrad M. [email protected] wrote:

Quoth Ilmari H.:

On 9/14/07, Konrad M. [email protected] wrote:

Hmm, am I not seeing it (just using ‘mdh -p’) or can metadata.rb
extract
Yeah, I failed at using git. Jeez. Sorry about that.
wrapped in an ASF audio/video container format[0], so, you could use it to


irb> Metadata.extract_text(‘myfile.jpg’)
wma, m4a

Video:
What you manage to make mplayer play, which can be just about anything.
Then again, missing title and author data, etc. (do videos even have
those?)
pdf, ppt, odp, sxi, ps, ps.gz, html, txt
Should work:
- OpenOffice docs work to some degree (personally, I’m using unoconv to
convert OO docs to temp PDFs for the text & dimensions extraction, so
those bits of data are missing.)
- MS Office docs to some degree (ppt at least, doc and xls should work
too,

  poppler-utils
  • You do want to install the latest versions of dcraw and
    De-compress archive and enter its top directory.

License

Ruby’s


Ilmari H. <ilmari.heikkinen gmail com>
http://fhtr.blogspot.com

Er, I’m still not getting information out of ogg files:

$ mdh -p ~/music/bowling_for_soup_-_1985.ogg

Video.Duration: 192.78
Audio.Samplerate: 44100
Audio.Bitrate: 192.0
Image.DimensionUnit: px
Video.Codec: “”
File.Size: 4618665
Audio.Codec: vrbs
File.Modified: 2007-01-03T22:10:11-08:00
File.Format: video/x-theora+ogg

$ mplayer ~/music/bowling_for_soup_-_1985.ogg

Clip info:
Genre: Pop
Name: 1985
Artist: Bowling for Soup
Creation Date: 2004
Album: A Hangover You Don’t Deserve
Track: 03

Thanks for your quick responses!

Quoth Ilmari H.:

On 9/15/07, Konrad M. [email protected] wrote:

Quoth Ilmari H.:

On 9/14/07, Konrad M. [email protected] wrote:

Hmm, am I not seeing it (just using ‘mdh -p’) or can metadata.rb
extract
Yeah, I failed at using git. Jeez. Sorry about that.
wrapped in an ASF audio/video container format[0], so, you could use it to


irb> Metadata.extract_text(‘myfile.jpg’)
wma, m4a

Video:
What you manage to make mplayer play, which can be just about anything.
Then again, missing title and author data, etc. (do videos even have
those?)
pdf, ppt, odp, sxi, ps, ps.gz, html, txt
Should work:
- OpenOffice docs work to some degree (personally, I’m using unoconv to
convert OO docs to temp PDFs for the text & dimensions extraction, so
those bits of data are missing.)
- MS Office docs to some degree (ppt at least, doc and xls should work
too,

  poppler-utils
  • You do want to install the latest versions of dcraw and
    De-compress archive and enter its top directory.

License

Ruby’s


Ilmari H. <ilmari.heikkinen gmail com>
http://fhtr.blogspot.com

Any chance you could wrap this up as a gem? It’s not something I care
strongly about, and I don’t know how complicated the process is, but I
think
it would help ease installation for some users.

Quoth Ilmari H.:

Track: 03

Oh, nice, mplayer does give out metadata fields. I better augment
the mplayer info parser to grab those :slight_smile:

0.5 here we come!

Also:
For mp3 id3v2 tags, the binary string “\xCB\x99\xC5\xA3” is being
inserted
at the front of all the string fields.

$ mdh -p ~/music/Snoop\ Dogg\ -\ Gin\ &\ Juice.mp3

Audio.Album: “\xCB\x99\xC5\xA3Death Row’s Snoop Doggy Dogg Greatest
Hits
(2001)”

Audio.Genre: “\xCB\x99\xC5\xA3Hip-Hop”
Audio.Title: “\xCB\x99\xC5\xA3Gin & Juice”

Audio.Artist: “\xCB\x99\xC5\xA3Snoop Dogg”

I think this is an id3v2 thing. Also, it happens in more than one file
and
amaroK sees the tags “correctly”, so I’m thinking it’s on the metadata’s
end. Thanks!

Quoth Ilmari H.:

Track: 03

Oh, nice, mplayer does give out metadata fields. I better augment
the mplayer info parser to grab those :slight_smile:

0.5 here we come!

Another bug (Sorry :D):
$ mdh -p ~/music/Limp\ Bizkit\ -\ Rollin’\ (edited).ogg
sh: -c: line 0: syntax error near unexpected token (' sh: -c: line 0:ogginfo ‘/home/konrad/music/Limp Bizkit - Rollin’
(edited).ogg’’

(Last line was broken up to email length.) You’re already escaping
single
quotes for the shell, need to escape start-parens and end-parens as
well.

Thanks,

On 9/15/07, Konrad M. [email protected] wrote:

Audio.Codec: vrbs
File.Modified: 2007-01-03T22:10:11-08:00

File.Format: video/x-theora+ogg

^- That’s the problem there. It thinks it’s a video file.

Why? Probably because I hacked the mimetype guesser to _not_ assume things based on the filename extension, and the shared-mime-info db assumes that the guesser _is_ assuming things based on the filename extension.

Which is something I’d rather not do with downloaded files (which, by
their very nature, have wild disparities between the extension and the
real mimetype.) And the header content-type is often totally wrong or
doesn’t match shared-mime-info’s naming (e.g.
application/octet-stream vs. image/gif, audio/x-mp3 vs. audio/mpeg,
video/divx vs. video/x-msvideo, video/x-ms-asf vs. video/vnd.ms-asf…)

And this magic-over-extension sometimes leads to me getting generic
lesser-magic guesses instead of more specific filename extension
guesses (e.g. zip instead of OO document.) So, I have a list of
generic formats that defer to the extension rather than rely on
the lesser-magic.

Anyhow, it’s ugly, hacky magic.
Just like the rest of mimetype guessing.
</technical blather>

But! Fixing this instance of the problem in the next thirty seconds.
… There!

And now, adding ogginfo metadata to video/x-theora+ogg.

Ok, try this:

http://dark.fhtr.org/repos/metadata/metadata-0.4.tar.gz

Thanks for your quick responses!

Thanks for the bug reports! They really help in making this thing
more robust.

On 9/15/07, Konrad M. [email protected] wrote:

$ mplayer ~/music/bowling_for_soup_-_1985.ogg

Clip info:
Genre: Pop
Name: 1985
Artist: Bowling for Soup
Creation Date: 2004
Album: A Hangover You Don’t Deserve
Track: 03

Oh, nice, mplayer does give out metadata fields. I better augment
the mplayer info parser to grab those :slight_smile:

0.5 here we come!