Sitemap

Unveiling Insights from IPL Ticket Booking Data with Apache Spark

4 min readApr 8, 2024
IPL

Introduction:

In the realm of sports, the Indian Premier League (IPL) stands out as one of the most electrifying events, capturing the attention of millions of cricket enthusiasts worldwide.

Beyond the thrill of the matches, understanding the dynamics of ticket bookings provides valuable insights into fan engagement and team popularity. In this blog post, we’ll delve into analyzing IPL ticket booking data using Apache Spark, a powerful tool for processing large-scale datasets.

Understanding the Data:

The dataset we’re working with comprises records of ticket bookings for various IPL teams across different stadiums. Each record includes information such as the team name, stadium ID, and user login status. Our goal is to extract meaningful insights from this data, including the number of seats booked for each team and identifying stadiums with the maximum bookings.

Processing the Data with Apache Spark:

We begin by processing the streaming data using Apache Spark’s robust functionalities. The processing pipeline involves several key steps:

  1. Mapping Events: Each event (ticket booking) is mapped to a tuple containing the team name, stadium ID, and a count based on the user’s login status.
  2. Reducing by Key: Events are then reduced by key (team name, stadium ID) to sum up the counts, yielding the total number of seats booked for each team at each stadium.
  3. Restructuring Data: The data is restructured to have the team name as the key and a tuple containing the stadium ID and seats booked as the value.
  4. Identifying Maximum Bookings: Stadiums with the maximum seats booked for each team are determined by reducing the data by the team name and selecting the stadium with the highest bookings.
  5. Formatting Output: Finally, the processed data is formatted into a dictionary format for clear presentation of the results.

Insights Revealed: After processing the data, we uncover intriguing insights into IPL ticket bookings:

  • Mumbai Indians lead in bookings, with 3 seats booked at Stadium 1.
  • Delhi Capitals secure 1 seat booked at Stadium 3.
  • Chennai Super Kings and Royal Challengers Bangalore each have 2 seats booked, with Stadiums 4 and 5 being their top venues, respectively.

These findings shed light on the popularity of IPL teams and the preferences of fans in terms of stadium choices.

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Initialize SparkContext
sc = SparkContext("local[2]", "IPLTicketBookingAnalysis")
ssc = StreamingContext(sc, 5)

# Define the streaming data
inputStream = [ {"team_name": "Mumbai Indians", "stadium_id": 1, "is_user_logged_in": True, "unique_user_id": "user1"}, {"team_name": "Mumbai Indians", "stadium_id": 1, "is_user_logged_in": True, "unique_user_id": "user2"}, {"team_name": "Mumbai Indians", "stadium_id": 1, "is_user_logged_in": True, "unique_user_id": "user1"}, {"team_name": "Mumbai Indians", "stadium_id": 2, "is_user_logged_in": False, "unique_user_id": "user3"}, {"team_name": "Mumbai Indians", "stadium_id": 2, "is_user_logged_in": True, "unique_user_id": "user7"}, {"team_name": "Delhi Capitals", "stadium_id": 3, "is_user_logged_in": True, "unique_user_id": "user1"}, {"team_name": "Chennai Super Kings", "stadium_id": 4, "is_user_logged_in": True, "unique_user_id": "user4"}, {"team_name": "Chennai Super Kings", "stadium_id": 4, "is_user_logged_in": True, "unique_user_id": "user5"}, {"team_name": "Royal Challengers Bangalore", "stadium_id": 5, "is_user_logged_in": True, "unique_user_id": "user6"}, {"team_name": "Royal Challengers Bangalore", "stadium_id": 5, "is_user_logged_in": True, "unique_user_id": "user8"}]

# Process function to analyze IPL ticket bookings
def process(inputStream):
return (
# Step 1: Map each event to a tuple of (team name, stadium ID) and a count based on user login status
inputStream.map(lambda event: ((event["team_name"], event["stadium_id"]), 1 if event["is_user_logged_in"] else 0))

# Step 2: Reduce by key (team name, stadium ID), summing counts
.reduceByKey(lambda a, b: a + b)

# Step 3: Restructure data to have team name as key and (stadium ID, seats booked) as value
.map(lambda x: ((x[0][0],), (x[0][1], x[1])))

# Step 4: Reduce by key (team name), keeping stadium with max seats booked
.reduceByKey(lambda a, b: a if a[1] > b[1] else b)

# Step 5: Map to dictionary format for the final output
.map(lambda x: {"team_name": x[0][0], "stadium_id": x[1][0], "seats_booked": x[1][1]})
)

# Process the streaming data
result = process(ssc.parallelize(inputStream))

# Start the streaming context
ssc.start()
ssc.awaitTermination()

# Display the results
for r in result.collect():
print(r)

# Output:
{'team_name': 'Mumbai Indians', 'stadium_id': 1, 'seats_booked': 3}
{'team_name': 'Delhi Capitals', 'stadium_id': 3, 'seats_booked': 1}
{'team_name': 'Chennai Super Kings', 'stadium_id': 4, 'seats_booked': 2}
{'team_name': 'Royal Challengers Bangalore', 'stadium_id': 5, 'seats_booked': 2}

Conclusion:

Analyzing IPL ticket booking data using Apache Spark enables us to glean valuable insights into fan behavior and team engagement. By leveraging the powerful capabilities of Spark for data processing, we uncover patterns that contribute to a deeper understanding of the dynamics within the IPL ecosystem. As the tournament continues to captivate audiences, such data-driven analyses offer valuable perspectives for teams, organizers, and enthusiasts alike.

--

--

Kuldeep Pal
Kuldeep Pal

Written by Kuldeep Pal

Data Engineer | Software Engineer | Data Platform | Big Data | Machine learning | https://www.linkedin.com/in/kuldeep27396/

No responses yet