Saturday, 28 December 2013

Using Apache Pig to filter bot traffic from web server logs

Apache Pig is a high-level language for creating MapReduce applications. This recipe will use Apache Pig and a Pig user-defined filter function (UDF) to remove all bot traffic from a sample web server log dataset. Bot traffic is the non-human traffic that visits a webpage, such as spiders.
Getting ready
You will need to download/compile/install the following:
ffVersion 0.8.1 or better of Apache Pig from http://pig.apache.org/
ffTest data: apache_tsv.txt and useragent_blacklist.txt from the support page on the Packt website, http://www.packtpub.com/support
ffPlace apache_tsv.txt in HDFS and put useragent_blacklist.txt in your current working directory
How to do it...
Carry out the following steps to filter bot traffic using an Apache Pig UDF:
1. First, write a Pig UDF that extends the Pig FilterFunc abstract class. This class will be used to filter records in the weblogs dataset by using the user agent string.
public class IsUseragentBot extends FilterFunc {
private Set<String> blacklist = null;
private void loadBlacklist() throws IOException {
blacklist = new HashSet<String>();
BufferedReader in = new BufferedReader(new


FileReader("blacklist"));
String userAgent = null;
while ((userAgent = in.readLine()) != null) {
blacklist.add(userAgent);
}
}


@Override
public Boolean exec(Tuple tuple) throws IOException {
if (blacklist == null) {
loadBlacklist();
}
if (tuple == null || tuple.size() == 0) {
return null;
}


String ua = (String) tuple.get(0);
if (blacklist.contains(ua)) {
return true;
}
return false;
}

}

2. Next, create a Pig script in your current working directory. At the beginning of the Pig script, give the MapReduce framework the path to useragent_blacklist.txt in HDFS:
set mapred.cache.files '/user/hadoop/blacklist.txt#blacklist';
set mapred.create.symlink 'yes';
3. Register the JAR file containing the IsUseragentBot class with Pig, and write the Pig script to filter the weblogs by the user agent:
register myudfjar.jar;
all_weblogs = LOAD '/user/hadoop/apache_tsv.txt' AS (ip: chararray, timestamp:long, page:chararray, http_status:int, payload_size:int, useragent:chararray);
nobots_weblogs = FILTER all_weblogs BY NOT com.packt.ch3.etl.pig. IsUseragentBot(useragent);
STORE nobots_weblogs INTO '/user/hadoop/nobots_weblogs';
To run the Pig job, put myudfjar.jar into the same folder as the Pig script and execute it.
$ ls
$ myudfjar.jar filter_bot_traffic.pig
$ pig –f filter_bot_traffic.pig

Perform the following steps to sort data using Apache Pig:
1. First load the web server log data into a Pig relation:
nobots_weblogs = LOAD '/user/hadoop/apache_nobots_tsv.txt' AS (ip: chararray, timestamp:long, page:chararray, http_status:int, payload_size:int, useragent:chararray);
2. Next, order the web server log records by the timestamp field in the ascending order:
ordered_weblogs = ORDER nobots BY timestamp;
3. Finally, store the sorted results in HDFS:
STORE ordered_weblogs INTO '/user/hadoop/ordered_weblogs';\
4. Run the Pig job:
$ pig –f ordered_weblogs.pig

7ffb8f8ed136c5fe3a5dd6eedc32eae7 /cx.html 2012-05-10 21:17:05 59.19.27.24 Korea, Republic of 

No comments:

Post a Comment