[Cialug] a bit OT: working with big data
Brett Neese
brneese at brneese.com
Wed Nov 14 20:00:50 CST 2012
Hey all,
Does anyone in this group have experience working with Hive/Hadoop
MapReduce frameworks?
I'm currently working with this how-to
<http://aws.amazon.com/articles/5249664154115844>and associated data-set
and am having some troubles trying to figure out how to structure my
queries.
My needs diverge around step "Word Ratio By Decade" -- namely, a) I would
not like to limit to decade marks when I create my new table, instead, I
would prefer to limit to every four years and b) Instead of using the
entire corpus, I would like to create a table that is limited to the names
of the states (for instance "Iowa" and "New York"). The current query is
INSERT OVERWRITE TABLE by_decade
SELECT
a.gram,
b.decade,
sum(a.occurrences) / b.total
FROM
normalized a
JOIN (
SELECT
substr(year, 0, 3) as decade,
sum(occurrences) as total
FROM
normalized
GROUP BY
substr(year, 0, 3)
) b
ON
substr(a.year, 0, 3) = b.decade
GROUP BY
a.gram,
b.decade,
b.total;
And I have very limited SQL skills and have no idea how to go about
adapting that. I do know I will need to manually hand code in a WHERE
clause to limit to the names of the states.
I would then like to take that raw data and export it out of Hadoop for
further manipulation.
Can anyone help me here? I greatly appreciate it.
--
Brett Neese
563-210-3459
http://brneese.com
More information about the Cialug
mailing list