[Cialug] a bit OT: working with big data

Wed Nov 14 20:00:50 CST 2012

Hey all,

Does anyone in this group have experience working with Hive/Hadoop
MapReduce frameworks?

I'm currently working with this how-to
<http://aws.amazon.com/articles/5249664154115844>and associated data-set
and am having some troubles trying to figure out how to structure my
queries.

My needs diverge around step "Word Ratio By Decade" -- namely, a) I would
not like to limit to decade marks when I create my new table, instead, I
would prefer to limit to every four years and b) Instead of using the
entire corpus, I would like to create a table that is limited to the names
of the states (for instance "Iowa" and "New York").  The current query is

INSERT OVERWRITE TABLE by_decade
SELECT
 a.gram,
 b.decade,
 sum(a.occurrences) / b.total
FROM
 normalized a
JOIN (
 SELECT
  substr(year, 0, 3) as decade,
  sum(occurrences) as total
 FROM
  normalized
 GROUP BY
  substr(year, 0, 3)
) b
ON
 substr(a.year, 0, 3) = b.decade
GROUP BY
 a.gram,
 b.decade,
 b.total;

And I have very limited SQL skills and have no idea how to go about
adapting that. I do know I will need to manually hand code in a WHERE
clause to limit to the names of the states.

I would then like to take that raw data and export it out of Hadoop for
further manipulation.

Can anyone help me here? I greatly appreciate it.

-- 
Brett Neese
563-210-3459
http://brneese.com