Tuesday, 29 April 2025

BIG DATA CD 362

Program 1:

Implement the following Data structures in java a)lists

b)Stacks c)Queues

sol:

a)  List

(i) ArrayList

import java.util.ArrayList; import java.util.Collections; import java.util.Iterator;

public class ArrayListExample {

public static void main(String[] args) {

// Creating an ArrayList

ArrayList<String> list = new ArrayList<>();

// Adding elements list.add("Apple");

list.add("Banana");

list.add("Cherry");

list.add("Mango"); System.out.println("Initial List: " + list);

// Accessing an element

System.out.println("Element at index 2: " + list.get(2));

// Updating an element list.set(1, "Blueberry");

System.out.println("After updating index 1: " + list);

// Removing an element list.remove("Mango");

System.out.println("After removing 'Mango': " + list);


// Checking if an element exists

System.out.println("Contains 'Apple'? " + list.contains("Apple"));

// Sorting the list Collections.sort(list);

System.out.println("Sorted List: " + list);

// Iterating using for-each loop System.out.println("Iterating using for-each loop:"); for (String item : list) {

System.out.println(item);

}

// Iterating using Iterator System.out.println("Iterating using Iterator:"); Iterator<String> iterator = list.iterator(); while (iterator.hasNext()) {

System.out.println(iterator.next());

}

// Getting size of the list System.out.println("Size of list: " + list.size());

// Clearing the list list.clear();

System.out.println("After clearing: " + list);

}

}

Output:

Initial List: [Apple, Banana, Cherry, Mango] Element at index 2: Cherry

After updating index 1: [Apple, Blueberry, Cherry, Mango]


After removing 'Mango': [Apple, Blueberry, Cherry] Contains 'Apple'? true

Sorted List: [Apple, Blueberry, Cherry] Iterating using for-each loop:

Apple Blueberry Cherry

Iterating using Iterator:

Apple Blueberry Cherry

Size of list: 3 After clearing: []

 

(ii) LINKED LIST

import java.util.LinkedList; public class LinkedListExample {

public static void main(String[] args) { System.out.println("\nLinkedList Example:"); LinkedList<String> list = new LinkedList<>();

// Adding elements list.add("A");

list.add("B");

list.add("C");

list.add("D");

System.out.println("Initial LinkedList: " + list);

// Adding elements at first and last positions


list.addFirst("Start"); list.addLast("End");

System.out.println("After adding at first and last: " + list);

// Accessing elements

System.out.println("First Element: " + list.getFirst()); System.out.println("Last Element: " + list.getLast());

// Removing elements

System.out.println("Removed First: " + list.removeFirst()); System.out.println("Removed Last: " + list.removeLast()); System.out.println("LinkedList after removals: " + list);

// Checking if an element exists System.out.println("Contains 'B'? " + list.contains("B"));

// Getting size

System.out.println("Size of LinkedList: " + list.size());

// Iterating through the LinkedList System.out.println("Iterating through LinkedList:"); for (String item : list) {

System.out.println(item);

}

// Clearing the LinkedList list.clear();

System.out.println("LinkedList after clearing: " + list);

}

}

Output:

LinkedList Example:

Initial LinkedList: [A, B, C, D]


After adding at first and last: [Start, A, B, C, D, End] First Element: Start

Last Element: End Removed First: Start Removed Last: End

LinkedList after removals: [A, B, C, D] Contains 'B'? true

Size of LinkedList: 4 Iterating through LinkedList:

A B C D

LinkedList after clearing: []

 

 

(iii) VECTOR

import java.util.Vector; public class VectorExample {

public static void main(String[] args) { System.out.println("\nVector Example:"); Vector<Integer> vector = new Vector<>();

// Adding elements vector.add(10); vector.add(20); vector.add(30); vector.add(40);

System.out.println("Initial Vector: " + vector);


// Adding at a specific index vector.add(1, 15);

System.out.println("After adding 15 at index 1: " + vector);

// Replacing an element vector.set(2, 25);

System.out.println("After updating index 2: " + vector);

// Removing elements

System.out.println("Removed Element: " + vector.remove(0)); System.out.println("Vector after removals: " + vector);

// Checking if an element exists System.out.println("Contains 20? " + vector.contains(20));

// Getting an element

System.out.println("Element at index 1: " + vector.get(1));

// Getting size and capacity System.out.println("Size: " + vector.size()); System.out.println("Capacity: " + vector.capacity());

// Iterating through the Vector System.out.println("Iterating through Vector:"); for (Integer num : vector) {

System.out.println(num);

}

// Clearing the Vector vector.clear();

System.out.println("Vector after clearing: " + vector);

}

}

Output:


Vector Example:

Initial Vector: [10, 20, 30, 40]

After adding 15 at index 1: [10, 15, 20, 30, 40]

After updating index 2: [10, 15, 25, 30, 40]

Removed Element: 10

Vector after removals: [15, 25, 30, 40]

Contains 20? false

Element at index 1: 25

Size: 4

Capacity: 10

Iterating through Vector:

15

25

30

40

Vector after clearing: []

b)  STACK

import java.util.Stack; public class StackExample {

public static void main(String[] args) {

// Creating a Stack

Stack<String> stack = new Stack<>();

// PUSH operation (Adding elements) stack.push("Apple"); stack.push("Banana"); stack.push("Cherry");

System.out.println("Stack after push: " + stack);


// PEEK operation (View top element) System.out.println("Top element (peek): " + stack.peek());

// POP operation (Removing top element) System.out.println("Popped element: " + stack.pop()); System.out.println("Stack after pop: " + stack);

// SEARCH operation (Find position of element)

int position = stack.search("Apple"); // Returns 1 (position from top) System.out.println("Position of 'Apple': " + position);

// CHECK if Stack is empty

System.out.println("Is stack empty? " + stack.isEmpty());

}

}

Output:

Popped element: Cherry

Stack after pop: [Apple, Banana] Position of 'Apple': 2

Is stack empty? False

 

c)  QUEUE

(i) PRIORITY QUEUE

import java.util.PriorityQueue; public class PriorityQueueExample {

public static void main(String[] args) { System.out.println("\nPriorityQueue Example:"); PriorityQueue<Integer> pq = new PriorityQueue<>();

// Adding elements pq.add(30);


pq.add(10);

pq.add(20);

pq.add(40);

System.out.println("Initial PriorityQueue: " + pq);

// Accessing the head element

System.out.println("Peek (Head Element): " + pq.peek());

// Removing elements

System.out.println("Poll (Removing Head): " + pq.poll()); System.out.println("PriorityQueue after poll: " + pq);

// Checking if an element exists System.out.println("Contains 20? " + pq.contains(20));

// Getting size

System.out.println("Size of PriorityQueue: " + pq.size());

// Iterating through the PriorityQueue System.out.println("Iterating through PriorityQueue:"); for (Integer num : pq) {

System.out.println(num);

}

// Clearing the PriorityQueue pq.clear();

System.out.println("PriorityQueue after clearing: " + pq);

}

}

Output:

PriorityQueue Example:

Initial PriorityQueue: [10, 30, 20, 40] Peek (Head Element): 10


Poll (Removing Head): 10 PriorityQueue after poll: [20, 30, 40]

Contains 20? true

Size of PriorityQueue: 3 Iterating through PriorityQueue:

20

30

40

PriorityQueue after clearing: []

 

 

(ii) DEQUE

import java.util.Deque; import java.util.ArrayDeque; public class DequeExample {

public static void main(String[] args) { System.out.println("\nDeque Example:"); Deque<String> deque = new ArrayDeque<>(); deque.add("A");

deque.addFirst("Start"); deque.addLast("End"); deque.add("B"); System.out.println("Deque: " + deque);

System.out.println("First Element: " + deque.getFirst()); System.out.println("Last Element: " + deque.getLast()); System.out.println("Removed First: " + deque.removeFirst()); System.out.println("Removed Last: " + deque.removeLast()); System.out.println("Deque after removals: " + deque);


System.out.println("Contains 'A'? " + deque.contains("A")); System.out.println("Size: " + deque.size());

for (String item : deque)

{

System.out.println(item);

}

deque.clear();

System.out.println("Deque after clearing: " + deque);

}

}

Output:

Deque Example:

Deque: [Start, A, End, B] First Element: Start

Last Element: B Removed First: Start Removed Last: B

Deque after removals: [A, End] Contains 'A'? true

Size: 2 A

End

Deque after clearing: []

 

 

(iii) ArrayDeque

import java.util.ArrayDeque; public class ArrayDequeExample {


public static void main(String[] args) {

// Creating an ArrayDeque

ArrayDeque<String> deque = new ArrayDeque<>();

// Adding elements at the end deque.add("Apple"); deque.add("Banana"); deque.add("Cherry");

// Adding elements at the front deque.addFirst("Mango"); deque.addLast("Orange");

// Printing the deque

System.out.println("Deque after additions: " + deque);

// Removing elements deque.removeFirst(); // Removes "Mango" deque.removeLast(); // Removes "Orange"

// Printing the deque after removals System.out.println("Deque after removals: " + deque);

// Accessing elements

System.out.println("First element: " + deque.getFirst()); System.out.println("Last element: " + deque.getLast());

}

}

Output:

Deque after additions: [Mango, Apple, Banana, Cherry, Orange] Deque after removals: [Apple, Banana, Cherry]

First element: Apple Last element: Cherry


Program2:

 

Implement the following data structures in java a)Map

b) Set sol:

a) Map

 

(i) HashMap

import java.util.HashMap; public class HashMapExample {

public static void main(String[] args) { System.out.println("\nHashMap Example:"); HashMap<Integer, String> map = new HashMap<>(); map.put(1, "Apple");

map.put(2, "Banana");

map.put(3, "Cherry");

map.put(4, "Date");

System.out.println("Initial HashMap: " + map); System.out.println("Get key 2: " + map.get(2)); map.remove(3);

System.out.println("After removing key 3: " + map); System.out.println("Contains key 1? " + map.containsKey(1)); System.out.println("Contains value 'Banana'? " + map.containsValue("Banana")); System.out.println("Size: " + map.size());

System.out.println("Iterating through HashMap:"); for (var entry : map.entrySet()) {

System.out.println("Key: " + entry.getKey() + ", Value: " + entry.getValue());


}

map.clear();

System.out.println("HashMap after clearing: " + map);

}

}

Output:

HashMap Example:

Initial HashMap: {1=Apple, 2=Banana, 3=Cherry, 4=Date} Get key 2: Banana

After removing key 3: {1=Apple, 2=Banana, 4=Date} Contains key 1? true

Contains value 'Banana'? true Size: 3

Iterating through HashMap:

Key: 1, Value: Apple Key: 2, Value: Banana Key: 4, Value: Date

HashMap after clearing: {}

 

 

 

 

(ii) LinkedHashMap

import java.util.LinkedHashMap; public class LinkedHashMapExample {

public static void main(String[] args) { System.out.println("\nLinkedHashMap Example:"); LinkedHashMap<String, Integer> map = new LinkedHashMap<>(); map.put("One", 1);


map.put("Two", 2);

map.put("Three", 3);

map.put("Four", 4);

System.out.println("Initial LinkedHashMap: " + map); System.out.println("Get value for 'Two': " + map.get("Two")); map.remove("Three");

System.out.println("After removing 'Three': " + map); System.out.println("Contains key 'One'? " + map.containsKey("One")); System.out.println("Contains value 4? " + map.containsValue(4)); System.out.println("Size: " + map.size());

System.out.println("Iterating through LinkedHashMap:"); for (var entry : map.entrySet()) {

System.out.println("Key: " + entry.getKey() + ", Value: " + entry.getValue());

}

map.clear();

System.out.println("LinkedHashMap after clearing: " + map);

}

}

Output:

LinkedHashMap Example:

Initial LinkedHashMap: {One=1, Two=2, Three=3, Four=4} Get value for 'Two': 2

After removing 'Three': {One=1, Two=2, Four=4} Contains key 'One'? true

Contains value 4? true Size: 3

Iterating through LinkedHashMap:


Key: One, Value: 1 Key: Two, Value: 2 Key: Four, Value: 4

LinkedHashMap after clearing: {}

 

 

(iii) TreeMap

import java.util.TreeMap; public class TreeMapExample {

public static void main(String[] args) { System.out.println("\nTreeMap Example:"); TreeMap<Integer, String> map = new TreeMap<>(); map.put(5, "Eagle");

map.put(1, "Apple");

map.put(3, "Cherry");

map.put(2, "Banana"); System.out.println("Initial TreeMap: " + map);

System.out.println("Get value for key 2: " + map.get(2)); map.remove(3);

System.out.println("After removing key 3: " + map); System.out.println("Contains key 1? " + map.containsKey(1)); System.out.println("Contains value 'Eagle'? " + map.containsValue("Eagle")); System.out.println("Size: " + map.size());

System.out.println("Iterating through TreeMap:"); for (var entry : map.entrySet()) {

System.out.println("Key: " + entry.getKey() + ", Value: " + entry.getValue());

}

System.out.println("First Key: " + map.firstKey());


System.out.println("Last Key: " + map.lastKey()); map.clear();

System.out.println("TreeMap after clearing: " + map);

}

}

Output:

TreeMap Example:

Initial TreeMap: {1=Apple, 2=Banana, 3=Cherry, 5=Eagle} Get value for key 2: Banana

After removing key 3: {1=Apple, 2=Banana, 5=Eagle} Contains key 1? true

Contains value 'Eagle'? true Size: 3

Iterating through TreeMap:

Key: 1, Value: Apple Key: 2, Value: Banana Key: 5, Value: Eagle First Key: 1

Last Key: 5

TreeMap after clearing: {}

 

b) Set

SET

(i) HashSet

import java.util.HashSet; public class HashSetExample {

public static void main(String[] args) {


System.out.println("\nHashSet Example:"); HashSet<String> set = new HashSet<>(); set.add("Apple");

set.add("Banana");

set.add("Cherry");

set.add("Date");

System.out.println("Initial HashSet: " + set); set.remove("Cherry");

System.out.println("After removing 'Cherry': " + set); System.out.println("Contains 'Apple'? " + set.contains("Apple")); System.out.println("Size: " + set.size()); System.out.println("Iterating through HashSet:");

for (String item : set) { System.out.println(item);

}

set.clear();

System.out.println("HashSet after clearing: " + set);

}

}

Output:

HashSet Example:

Initial HashSet: [Apple, Cherry, Date, Banana] After removing 'Cherry': [Apple, Date, Banana] Contains 'Apple'? true

Size: 3

Iterating through HashSet: Apple


Date Banana

HashSet after clearing: []

 

 

(ii) LinkedHashSet

import java.util.LinkedHashSet; public class LinkedHashSetExample {

public static void main(String[] args) { System.out.println("\nLinkedHashSet Example:"); LinkedHashSet<Integer> set = new LinkedHashSet<>(); set.add(10);

set.add(20);

set.add(30);

set.add(40);

System.out.println("Initial LinkedHashSet: " + set); set.remove(30);

System.out.println("After removing 30: " + set); System.out.println("Contains 20? " + set.contains(20)); System.out.println("Size: " + set.size()); System.out.println("Iterating through LinkedHashSet:"); for (Integer num : set) {

System.out.println(num);

}

set.clear();

System.out.println("LinkedHashSet after clearing: " + set);

}

}


Output:

LinkedHashSet Example:

Initial LinkedHashSet: [10, 20, 30, 40]

After removing 30: [10, 20, 40]

Contains 20? true

Size: 3

Iterating through LinkedHashSet:

10

20

40

LinkedHashSet after clearing: []

 

 

(iii) TreeSet

import java.util.TreeSet;

 

 

public class TreeSetExample {

public static void main(String[] args) {

// Creating a TreeSet

TreeSet<Integer> treeSet = new TreeSet<>();

 

 

// Adding elements to the TreeSet treeSet.add(20);

treeSet.add(10); treeSet.add(40); treeSet.add(30); treeSet.add(50);


// Printing TreeSet (It will be sorted) System.out.println("TreeSet: " + treeSet);

 

// Removing an element treeSet.remove(30);

System.out.println("After removing 30: " + treeSet);

 

 

// Checking if an element exists

System.out.println("Does TreeSet contain 20? " + treeSet.contains(20));

 

 

// Retrieving first and last elements System.out.println("First element: " + treeSet.first()); System.out.println("Last element: " + treeSet.last());

 

// Getting subset (headSet, tailSet, subSet) System.out.println("Elements less than 40: " + treeSet.headSet(40));

System.out.println("Elements greater than or equal to 20: " + treeSet.tailSet(20)); System.out.println("Elements between 10 and 40: " + treeSet.subSet(10, 40));

 

// Checking size of the TreeSet System.out.println("Size of TreeSet: " + treeSet.size());

 

// Clearing the TreeSet treeSet.clear();

System.out.println("After clearing, is empty? " + treeSet.isEmpty());

}

}


Output:

TreeSet: [10, 20, 30, 40, 50]

After removing 30: [10, 20, 40, 50] Does TreeSet contain 20? true First element: 10

Last element: 50

Elements less than 40: [10, 20]

Elements greater than or equal to 20: [20, 40, 50]

Elements between 10 and 40: [10, 20] Size of TreeSet: 4

After clearing, is empty? true

 

 

(iv) SortedSet

import java.util.SortedSet; import java.util.TreeSet;

public class SortedSetExample {

public static void main(String[] args) { System.out.println("\nSortedSet Example:"); SortedSet<Integer> set = new TreeSet<>(); set.add(50);

set.add(10);

set.add(40);

set.add(20);

set.add(30);

System.out.println("Initial SortedSet: " + set); set.remove(30);

System.out.println("After removing 30: " + set);


System.out.println("First Element: " + set.first()); System.out.println("Last Element: " + set.last()); System.out.println("Contains 20? " + set.contains(20)); System.out.println("Size: " + set.size()); System.out.println("Iterating through SortedSet:");

for (Integer num : set) { System.out.println(num);

}

set.clear();

System.out.println("SortedSet after clearing: " + set);

}

}

Output:

SortedSet Example:

Initial SortedSet: [10, 20, 30, 40, 50]

After removing 30: [10, 20, 40, 50]

First Element: 10

Last Element: 50

Contains 20? true

Size: 4

Iterating through SortedSet: 10

20

40

50

SortedSet after clearing: []


Program 3:

 

Implement the following file management tasks in Hadoop:

·       Adding files and directories

·       Retrieving files

·       Deleting files

 

Create a Directory in HDFS:

hdfs dfs -mkdir /user/gowthu/data

(Creates a directory named data under /user/gowthu/)

Upload (Add) a File to HDFS:

hdfs dfs -put localfile.txt /user/gowthu/data/

(Uploads localfile.txt from the local system to HDFS /user/gowthu/data/)

Copy a File from Local to HDFS:

hdfs dfs -copyFromLocal example.txt /user/gowthu/data/

(Copies example.txt from the local system to /user/gowthu/data/ in HDFS)

List Files in HDFS:

hdfs dfs -ls /user/gowthu/data/

(Lists all files inside /user/gowthu/data/)

Retrieve a File from HDFS to Local:

hdfs dfs -get /user/gowthu/data/example.txt

(Downloads example.txt from HDFS to the current local directory)

Copy a File from HDFS to Local:

hdfs dfs -copyToLocal /user/gowthu/data/example.txt /home/hasan/ (Copies example.txt from HDFS to /home/hasan/ on the local system) Delete a File in HDFS:

hdfs dfs -rm /user/gowthu/data/in1.txt

(Deletes in1.txt from HDFS)


Delete a Directory in HDFS:

hdfs dfs -rm -r /user/gowthu/data/

(Recursively deletes /user/gowthu/data/ and all its files)


Program 4:

Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.

import java.io.IOException; import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount {

public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) {

word.set(itr.nextToken()); context.write(word, one);

}

}


}

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context

) throws IOException, InterruptedException { int sum = 0;

for (IntWritable val : values) { sum += val.get();

}

result.set(sum); context.write(key, result);

}

}

public static void main(String[] args) throws Exception { Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));


System.exit(job.waitForCompletion(true) ? 0 : 1);

}

}

 

a)   Open Terminal

§  Run the following commands step by step.

b)   Check Current Directory

§  ls

§  pwd

c)   Create an Input File

§  cat > /home/cloudera/processfile1.txt

§  Enter some text:

(Example:Hadoop is good for Big Data Hadoop is not for Small Data

It is a Java-based framework)

d) Upload Input File to HDFS

§  hdfs dfs -mkdir /inputfolder1

§  hdfs dfs -put /home/cloudera/processfile1.txt /inputfolder1/

e)   Verify Input File in HDFS

§  hdfs dfs -cat /inputfolder1/processfile1.txt

f)   Run the MapReduce Job

§  hadoop jar /home/cloudera/wordCount.jar WordCountDriver

/inputfolder1/processfile1.txt /output1

g) Check Output Directory in HDFS

§  hdfs dfs -ls /output1

h)   View Final Word Count Output

§  hdfs dfs -cat /output1/part-r-00000


i) Cross-check with Original File

 

 

 

§  cat /home/cloudera/processfile1.txt

Output:

 

Big

1

Data

2

Hadoop

2

It

1

Java-based 1

Small         1

a                1

for             2


Program 5:

Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your data.

$pig

 

grunt> titanic = LOAD 'titanic_sample.csv' USING PigStorage(',')

AS (PassengerId:int, Survived:int, Pclass:int, Name:chararray, Sex:chararray, Age:int, SibSp:int, Parch:int, Ticket:chararray, Fare:float,

Cabin:chararray, Embarked:chararray); grunt> DUMP titanic;

output:

(1,0,3,Braund,male,22,1,0,A/5 21171,7.25,S)

(2,1,1,Cumings,female,38,1,0,PC 17599,71.2833,C85,C)

(3,1,3,Heikkinen,female,26,0,0,STON/O2. 3101282,7.925,S) (4,1,1,Futrelle,female,35,1,0,113803,53.1,C123,S)

(5,0,3,Allen,male,35,0,0,373450,8.05,S)

 

 

Sort Passengers by Age:

grunt> sorted_data = ORDER titanic BY Age ASC; grunt>DUMP sorted_data;

output:

(1,0,3,Braund,male,22,1,0,A/5 21171,7.25,S)

(3,1,3,Heikkinen,female,26,0,0,STON/O2. 3101282,7.925,S) (4,1,1,Futrelle,female,35,1,0,113803,53.1,C123,S)

(5,0,3,Allen,male,35,0,0,373450,8.05,S)

(2,1,1,Cumings,female,38,1,0,PC 17599,71.2833,C85,C)


Group Passengers by Survival Status:

grunt> grouped_data = GROUP titanic BY Survived; grunt>DUMP grouped_data;

output:

(0,{(1,0,3,Braund,male,22,1,0,A/5 21171,7.25,S), (5,0,3,Allen,male,35,0,0,373450,8.05,,S)}) (1,{(2,1,1,Cumings,female,38,1,0,PC 17599,71.2833,C85,C),

(3,1,3,Heikkinen,female,26,0,0,STON/O2. 3101282,7.925,S), (4,1,1,Futrelle,female,35,1,0,113803,53.1,C123,S)})

 

 

Project (Select) Only Specific Columns:

grunt> projected_data = FOREACH titanic GENERATE PassengerId, Name, Age; grunt>DUMP projected_data;

output:

(1,Braund,22) (2,Cumings,38) (3,Heikkinen,26) (4,Futrelle,35)

(5,Allen,35)

 

 

Filter Female Passengers Below Age 30:

grunt> filtered_data = FILTER titanic BY Sex == 'female' AND Age < 30; grunt>DUMP filtered_data;

output:

(3,1,3,Heikkinen,female,26,0,0,STON/O2. 3101282,7.925,S)

 

 

Join Titanic Data with Ticket Prices:

Sample Ticket Price Dataset (ticket_prices.csv)


Ticket,Price

PC 17599,71.28

113803,53.1

STON/O2. 3101282,7.92

 

 

grunt> ticket_data = LOAD 'ticket_prices.csv' USING PigStorage(',') AS (Ticket:chararray, Price:float);

grunt>joined_data = JOIN titanic BY Ticket, ticket_data BY Ticket; grunt>DUMP joined_data;

output:

(2,1,1,Cumings,female,38,1,0,PC 17599,71.2833,C85,C,PC 17599,71.28)

(3,1,3,Heikkinen,female,26,0,0,STON/O2. 3101282,7.925,S,STON/O2. 3101282,7.92) (4,1,1,Futrelle,female,35,1,0,113803,53.1,C123,S,113803,53.1)

 

Clean Data:

grunt> cleaned_data = FILTER titanic BY Age IS NOT NULL AND Fare IS NOT NULL;

grunt> cleaned_data = FOREACH cleaned_data GENERATE PassengerId, Survived, Pclass, LOWER(Sex) AS Sex, Age, Fare, Embarked;

grunt>DUMP cleaned_data;

output:

(1,0,3,male,22,7.25,S)

(2,1,1,female,38,71.2833,C)

(3,1,3,female,26,7.925,S)

(4,1,1,female,35,53.1,S)

(5,0,3,male,35,8.05,S)


Normalize Fare (Scale between 0-1):

grunt> fare_stats = FOREACH (GROUP cleaned_data ALL) GENERATE MIN(cleaned_data.Fare) AS min_fare,

MAX(cleaned_data.Fare) AS max_fare;

grunt> normalized_data = FOREACH cleaned_data GENERATE PassengerId, Survived, Pclass, Sex, Age,

(Fare - fare_stats.min_fare) / (fare_stats.max_fare - fare_stats.min_fare) AS NormalizedFare, Embarked;

grunt> dump normalized_data;

output:

(1,0,3,male,22,0.0,S)

(2,1,1,female,38,1.0,C)

(3,1,3,female,26,0.0112,S)

(4,1,1,female,35,0.774,S)

(5,0,3,male,35,0.0133,S)

 

 

Load data into new file:

grunt> STORE normalized_data INTO 'output/normalized_titanic' USING PigStorage(','); grunt> exit;

$ hdfs dfs -cat output/normalized_titanic

output:

1,0,3,male,22,0.0,S

2,1,1,female,38,1.0,C

3,1,3,female,26,0.0112,S

4,1,1,female,35,0.774,S

5,0,3,male,35,0.0133,S


Program 6:

Run Hive then use Hive to create, alter, and drop databases, tables, views, functions, and Indexes.

1  Create database:

hive> create database csea; hive> create database cseb; hive>use csea

2  Show database:

hive> show databases ;

Output:

OK

csea cseb default

Time taken: 0.985 seconds, Fetched: 3 row(s)

4  Alter database:

hive> alter database csea set DBPROPERTIES ('creator'='abc');

Output:

OK

Time taken: 0.196 seconds

5  Drop database:

hive> DROP DATABASE csea;

Output:

OK

Time taken: 0.196 seconds

6  Create Index:


CREATE TABLE orders ( order_id INT, customer_id INT, product STRING, category STRING,

price DOUBLE,

order_date STRING

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

 

CREATE INDEX category_index ON TABLE orders (category)

AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndex' WITH DEFERRED REBUILD;

 

Output:

OK

Time taken: 0.167 seconds

 

 

7  Altering Index:

ALTER INDEX category_index ON orders REBUILD; Output:

OK

Time taken: 0.167 seconds


8  Drop Index:

DROP INDEX category_index ON orders;

Output:

OK

Time taken: 0.161 seconds

 

 

9  Create table:

hive> use csea;

hive(csea)> create table student(sno:int,sna:string)

>row format delimited

>fields terminated by '\t'

>stored as textfile;

Output:

OK

Time taken 0.343 seconds

 

 

10  Altering in table:

hive(csea)> alter table student sno RENAME TO redg_no;

Output:

OK

Time taken 0.042 seconds;

 

 

11  Drop Table:

hive(csea)> DROP table student;

Output:

OK


Time taken 0.432 seconds.

 

 

12  Create view:

hive> CREATE VIEW 2012_emp_view (empno,empname,Joining_yr) AS

>  SELECT eno,ena,year FROM employee WHERE year=2012;

Output:

OK

Time taken: 0.079 seconds

 

 

13  Alter view:

hive> ALTER VIEW 2012_emp_view AS

>  SELECT eno,year FROM employee WHERE year=2012;

Output:

OK

Time taken: 0.117 seconds

 

 

14  Drop View:

hive> DROP VIEW 2012_emp_view;

Output:

OK

Time taken: 0.808 seconds

15  Create function:

hive> CREATE TEMPORARY FUNCTION abc AS 'com.example.hive.udf.PrimeCheckUDF';

Output:

OK

Time taken: 0.908 seconds


16  Altering function:

hive> ALTER FUNCTION abc

USING JAR '/new/path/to/updated_prime_check_udf.jar';

Output:

OK

Time taken: 0.704 seconds

17  Drop function:

hive> drop FUNCTION abc;

Output:

OK

Time taken: 0.808 seconds

 

 

CTAS in Hive (Create Table As Select): Create Table

CREATE TABLE high_salary_employees AS SELECT emp_id, emp_name, salary

FROM employee WHERE salary > 50000; Create partitioned table:

CREATE TABLE sales_partitioned ( sale_id INT,

product_id INT, amount FLOAT

)

PARTITIONED BY (sale_date STRING) STORED AS PARQUET;


Output:

OK

Time taken: 0.135 seconds

Creating a Bucketed Table

CREATE TABLE customers_bucketed ( customer_id INT,

name STRING, email STRING

)

CLUSTERED BY (customer_id) INTO 4 BUCKETS STORED AS ORC;

Output:

OK

Time taken:0.197 seconds

Joins:

Step1: Create Patients Table hive> CREATE TABLE patients (

>         patient_id INT,

>         name STRING,

>         age INT

>  )

>  ROW FORMAT DELIMITED

>  FIELDS TERMINATED BY ','

>  STORED AS TEXTFILE; OK

Time taken: 4.409 seconds


Step2: Create Diagnosis Table hive> CREATE TABLE diagnosis (

>         diagnosis_id INT,

>         patient_id INT,

>         disease STRING

>  )

>  ROW FORMAT DELIMITED

>  FIELDS TERMINATED BY ','

>  STORED AS TEXTFILE; OK

Time taken: 0.111 seconds Step3: Prepare Input Data Files

[cloudera@quickstart ~]$ cat >patients.txt 1,john,45

2,bipin,44 3,rahul,23 4,neer,45 5,ram,24^Z

[1]+ Stopped                                     cat > patients.txt [cloudera@quickstart ~]$ cat patients.txt 1,john,45

2,bipin,44 3,rahul,23 4,neer,45

 

Step4:[cloudera@quickstart ~]$ cat >diagnosis1.txt 101,1,diabtes


103,2,hypertension 104,4,sugar 105,3,BP

^Z

[3]+ Stopped                 cat > diagnosis1.txt [cloudera@quickstart ~]$ cat diagnosis1.txt 101,1,diabtes

103,2,hypertension 104,4,sugar 105,3,BP

Step5: : Upload Data to HDFS

1. Create a directory

[cloudera@quickstart ~]$ hdfs dfs -mkdir -p /user/hive/warehouse/patients_data

2. Put the files into directory

[cloudera@quickstart ~]$ hdfs dfs -put /home/cloudera/patients.txt

/user/hive/warehouse/patients_data

[cloudera@quickstart ~]$ hdfs dfs -put /home/cloudera/diagnosis1.txt

/user/hive/warehouse/patients_data

3. verify files loaded into hdfs or not

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/patients_data/ Found 2 items

-rw-r--r-- 1 cloudera supergroup          54 2025-03-21 05:47

/user/hive/warehouse/patients_data/diagnosis1.txt

-rw-r--r-- 1 cloudera supergroup          42 2025-03-21 05:46

/user/hive/warehouse/patients_data/patients.txt Step6:Loading data from files into tables

hive> LOAD DATA INPATH '/user/hive/warehouse/patients_data/patients.txt'

>  INTO TABLE patients;


Loading data to table default.patients

Table default.patients stats: [numFiles=1, totalSize=42] OK

Time taken: 0.977 seconds

 

 

hive> LOAD DATA INPATH '/user/hive/warehouse/patients_data/diagnosis1.txt'

>  INTO TABLE diagnosis;

Loading data to table default.diagnosis

Table default.diagnosis stats: [numFiles=1, totalSize=54] OK

Time taken: 0.38 seconds Step7:check the data loaded or not hive> select * from patients; Output:

OK

 

1

john

45

2

bipin

44

3

rahul

23

4

neer

45

Time taken: 0.44 seconds, Fetched: 4 row(s) hive> select * from diagnosis;

Output:

OK

 

101

1

diabtes

103

2

hypertension

104

4

sugar

105

3

BP


Time taken: 0.071 seconds, Fetched: 4 row(s) Step8: run the queries

Inner Join (Only Matching Records)

hive> SELECT p.patient_id, p.name, p.age, d.disease

>  FROM patients p

>  JOIN diagnosis d

>  ON p.patient_id = d.patient_id;

Output:

Query ID = cloudera_20250321055050_d4edf790-2711-4070-8e25-2ea1285b59cd Total jobs = 1

Execution completed successfully MapredLocal task succeeded

OK

 

1

john

45

diabtes

2

bipin

44

hypertension

4

neer

45

sugar

3

rahul

23

BP

Time taken: 42.857 seconds, Fetched: 4 row(s)

Left Join (All Patients, Even Without Diagnosis)

hive> SELECT p.patient_id, p.name, p.age, d.disease

>  FROM patients p

>  LEFT JOIN diagnosis d

>  ON p.patient_id = d.patient_id;

Output:

Query ID = cloudera_20250321055252_3858002d-eae4-406a-9b01-87d332efabc2 Total jobs = 1

Execution completed successfully


MapredLocal task succeeded OK

1

john

45

diabtes

2

bipin

44

hypertension

3

rahul

23

BP

4

neer

45

sugar

Time taken: 36.066 seconds, Fetched: 4 row(s) Right Join (All Diagnoses, Even Without Patient) hive> SELECT p.patient_id, p.name, p.age, d.disease

>  FROM patients p

>  RIGHT JOIN diagnosis d

>  ON p.patient_id = d.patient_id;

Output:

Query ID = cloudera_20250321055454_b59fbe32-c0b0-4b52-b2ae-199db6988f36 Total jobs = 1

MapReduce Jobs Launched:

Stage-Stage-3: Map: 1 Cumulative CPU: 1.77 sec HDFS Read: 6561 HDFS Write: 72 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 770 msec OK

1

john

45

diabtes

2

bipin

44

hypertension

4

neer

45

sugar

3

rahul

23

BP

Time taken: 31.549 seconds, Fetched: 4 row(s)

Full Outer Join (All Records, Filling Missing Values)

hive> SELECT p.patient_id, p.name, p.age, d.disease


>  FROM patients p

>  FULL OUTER JOIN diagnosis d

>  ON p.patient_id = d.patient_id;

Output:

Query ID = cloudera_20250321055555_9279366d-fae8-425a-997c-725517746534

Total jobs = 1

MapReduce Jobs Launched:

Stage-Stage-1: Map: 2 Reduce: 1  Cumulative CPU: 5.34 sec  HDFS Read: 13287 HDFS

Write: 72 SUCCESS

Total MapReduce CPU Time Spent: 5 seconds 340 msec OK

1

john

45

diabtes

2

bipin

44

hypertension

3

rahul

23

BP

4

neer

45

sugar

Time taken: 48.205 seconds, Fetched: 4 row(s)


No comments:

Post a Comment

BIG DATA CD 362

Program 1: Implement the following Data structures in java a)lists b)Stacks c)Queues sol: a)   List (i) ArrayList imp...