Apache Pig tips (swill?)

Datetime:2017-01-04 05:29:00         Topic: Apache Pig          Share        Original >>
Here to See The Original Article!!!

Apache Pig is designed to handle analysis of large data sets using a high-level language (Pig Latin) that allows for parallelisation. Pig Latin compiles to sequences of Map-Reduce programs that can be executed on Hadoop.

This post pulls together an archive of some Apache Pig tips tweeted as “Apache #Pig tip of the day”.

  • 18th June 2013 :
    Use org.apache.pig.builtin.MonitoredUDF annotation to terminate your reg/Algebraic UDFs if they run for too long
  • 17th June 2013 :
    The HadoopJobHistoryLoader in the piggybank can be used to check for failed jobs amongst other things.
  • 14th June 2013 :
    visualise execution plan as a directed acyclic graph using -dot arg for EXPLAIN , pass output thru graphviz dot
  • 13th June 2013 :
    just like your favourite RDBMS Pig has an EXPLAIN command – get logical/physical and MapReduce execution plans.
  • 12th June 2013 :
    Know your path – the pig shell script tries to locate hadoop using ‘ which ‘.
  • 11th June 2013 :
    Penny (a debug/tracing tool) users should be aware that it has been removed from trunk in 0.11 #tidyout
  • 10th June 2013 :
    Pig can work with data serialized using Avro – the necessary AvroStorage & related classes are in the PiggyBank
  • 7th June 2013 :
    use PigUnit for unit testing Pig scripts. It defaults to local mode, use pigunit.exectype.cluster prop for MR.
  • 6th June 2013 :
    Amazon Elastic MapReduce supports Pig (0.9.x). You can run newer unsupported versions: https://forums.aws.amazon.com/thread.jspa?messageID=455015
  • 5th June 2013 :
    the PiggyBank contains contributed Java UDFs. Very useful stuff in contrib/piggybank (caveat: they are ‘as-is’)
  • 4th June 2013 :
    Pig supports user defined functions (UDFs). Write them in Java or (with less support) Python, JS, Ruby & #Groovy
  • 3rd June 2013 :
    use the ILLUSTRATE command to exemplify a Pig Latin script with concise, complete and realistic data #iterate
  • 2nd June 2013 :
    Pig provides a high level language (Pig Latin) for data analysis that compiles to Hadoop Map-Reduce data flows.
  • 31st May 2013 :
    pig -x local ‘ runs on a single machine, the default or ‘ -x mapreduce ‘ mode runs on a Hadoop cluster. #bigdata
  • 30th May 2013 :
    Pig supports multiple Hadoop versions (0.20 by default) – make sure you set the hadoopversion build parameter
  • 29th May 2013 :
    the -secretDebugCmd parameter shows the environment pig/hadoop will use (useful for Error 2998) #bigdata







New