[Cialug] a bit OT: working with big data

Tim Champion timchampion at gmail.com
Thu Nov 15 10:53:44 CST 2012


I know nothing about this MapReduce framework, but I can help with SQL.

For your "every 4 years" requirement, you could use the modulo operator MOD
like so:
(instead of substr(year, 0, 3) as decade, make it: year MOD 4 as every_four
then group by every_four instead of decade.

Is that what you are looking for? or do you want 4 year blocks grouped
together? 2001-2004, 2005-2008, etc.

I'm not clear on what you want to do for your state table. What do you want
that table structure to look like specifically? In general, that would
likely group by state, but not sure what other fields you're looking for.

Tim Champion
timchampion at gmail.com


On Wed, Nov 14, 2012 at 8:00 PM, Brett Neese <brneese at brneese.com> wrote:

> Hey all,
>
> Does anyone in this group have experience working with Hive/Hadoop
> MapReduce frameworks?
>
> I'm currently working with this how-to
> <http://aws.amazon.com/articles/5249664154115844>and associated data-set
> and am having some troubles trying to figure out how to structure my
> queries.
>
> My needs diverge around step "Word Ratio By Decade" -- namely, a) I would
> not like to limit to decade marks when I create my new table, instead, I
> would prefer to limit to every four years and b) Instead of using the
> entire corpus, I would like to create a table that is limited to the names
> of the states (for instance "Iowa" and "New York").  The current query is
>
>
> INSERT OVERWRITE TABLE by_decade
> SELECT
>  a.gram,
>  b.decade,
>  sum(a.occurrences) / b.total
> FROM
>  normalized a
> JOIN (
>  SELECT
>   substr(year, 0, 3) as decade,
>   sum(occurrences) as total
>  FROM
>   normalized
>  GROUP BY
>   substr(year, 0, 3)
> ) b
> ON
>  substr(a.year, 0, 3) = b.decade
> GROUP BY
>  a.gram,
>  b.decade,
>  b.total;
>
> And I have very limited SQL skills and have no idea how to go about
> adapting that. I do know I will need to manually hand code in a WHERE
> clause to limit to the names of the states.
>
> I would then like to take that raw data and export it out of Hadoop for
> further manipulation.
>
> Can anyone help me here? I greatly appreciate it.
>
> --
> Brett Neese
> 563-210-3459
> http://brneese.com
> _______________________________________________
> Cialug mailing list
> Cialug at cialug.org
> http://cialug.org/mailman/listinfo/cialug
>


More information about the Cialug mailing list