Apache Pig is a high-level language for creating MapReduce applications. This
recipe will use Apache Pig and a Pig user-defined filter function (UDF) to
remove all bot traffic from a sample web server log dataset. Bot traffic is the non-human
traffic that visits a webpage, such as spiders.
Getting ready
You will need to
download/compile/install the following:
ffVersion 0.8.1 or
better of Apache Pig from http://pig.apache.org/
ffTest data: apache_tsv.txt and useragent_blacklist.txt from the support page
on the Packt website, http://www.packtpub.com/support
ffPlace apache_tsv.txt in HDFS and put useragent_blacklist.txt in your current
working directory
How to do it...
Carry out the
following steps to filter bot traffic using an Apache Pig UDF:
1. First, write a Pig UDF that extends the Pig FilterFunc abstract class. This
class will be used to filter records in the weblogs dataset by using the user
agent string.
public class IsUseragentBot
extends FilterFunc {
private Set<String>
blacklist = null;
private void loadBlacklist()
throws IOException {
blacklist = new
HashSet<String>();
BufferedReader in = new BufferedReader(new
FileReader("blacklist"));
String
userAgent = null;
while
((userAgent = in.readLine()) != null) {
blacklist.add(userAgent);
}
}
@Override
public
Boolean exec(Tuple tuple) throws IOException {
if
(blacklist == null) {
loadBlacklist();
}
if (tuple
== null || tuple.size() == 0) {
return
null;
}
String ua
= (String) tuple.get(0);
if
(blacklist.contains(ua)) {
return
true;
}
return
false;
}
}
2. Next,
create a Pig script in your current working directory. At the beginning of the
Pig script, give the MapReduce framework the path to useragent_blacklist.txt in HDFS:
set mapred.cache.files
'/user/hadoop/blacklist.txt#blacklist';
set mapred.create.symlink 'yes';
3. Register
the JAR file containing the IsUseragentBot
class with Pig, and write the Pig
script to filter the weblogs by the user agent:
register myudfjar.jar;
all_weblogs = LOAD
'/user/hadoop/apache_tsv.txt' AS (ip: chararray, timestamp:long,
page:chararray, http_status:int, payload_size:int, useragent:chararray);
nobots_weblogs = FILTER
all_weblogs BY NOT com.packt.ch3.etl.pig. IsUseragentBot(useragent);
STORE nobots_weblogs INTO
'/user/hadoop/nobots_weblogs';
To run the Pig job,
put myudfjar.jar
into the same folder
as the Pig script and execute it.
$ ls
$ myudfjar.jar
filter_bot_traffic.pig
$ pig –f
filter_bot_traffic.pig
Perform the following
steps to sort data using Apache Pig:
1. First load the web server log data into a Pig relation:
nobots_weblogs = LOAD
'/user/hadoop/apache_nobots_tsv.txt' AS (ip: chararray, timestamp:long,
page:chararray, http_status:int, payload_size:int, useragent:chararray);
2. Next, order the web server log records by the timestamp field in the ascending
order:
ordered_weblogs = ORDER nobots
BY timestamp;
3. Finally, store the sorted results in HDFS:
STORE ordered_weblogs INTO
'/user/hadoop/ordered_weblogs';\
4. Run the Pig job:
$ pig –f
ordered_weblogs.pig
7ffb8f8ed136c5fe3a5dd6eedc32eae7 /cx.html 2012-05-10 21:17:05
59.19.27.24 Korea, Republic of
No comments:
Post a Comment