IBM Capstone Project
SpaceX - Data Collection via SpaceX API
This is the capstone project required to get the IBM Data Science Professional Certificate. Yan Luo, a data scientist and developer, and Joseph Santarcangelo, both data scientists at IBM, directed the project. The project will be presented in seven sections, and the lecture Jupyter notebooks and tutorials were used to compile the contents.
As a data scientist, I was tasked with forecasting if the first stage of the SpaceX Falcon 9 rocket will land successfully, so that a rival firm might submit better informed bids for a rocket launch against SpaceX. On its website, SpaceX promotes Falcon 9 rocket launches for 62 million dollars, whereas other companies charge upwards of 165 million dollars. A significant portion of the savings is attributable to SpaceX's ability to reuse the first stage. If we can determine whether the first stage will land, we can calculate the launch cost. This information might be useful if an alternative company want to compete with SpaceX for a rocket launch. In this project, I will conduct data science methodology including business understanding, data collection, data wrangling, exploratory data analysis, data visualization, model development, model evaluation, and stakeholder reporting.
The initial step is to send a get request to the SpaceX API. In addition, I will execute some simple data wrangling and formatting operations. The following libraries will be imported into the jupyter notebook.
- Requests enables HTTP requests, which will be used to retrieve data from an API.
- Pandas is a data manipulation and analysis package created in the Python programming language.
- NumPy is a Python library that enables support for massive, multidimensional arrays and matrices, as well as a vast number of high-level mathematical functions to operate on these arrays.
- Datetime is a library that enables the representation of dates.
In addition, we utilize set option() to print all columns and features of the dataframe.
import requests
import pandas as pd
import numpy as np
import datetime
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
Following that, we will develop a series of helper functions that will enable us to use the API to retrieve information from the launch data using identification numbers.
We would like to learn the booster's name from the rocket column. The defined function utilizes the rocket column in the dataset to access the API and append the data to the list.
def getBoosterVersion(data):
for x in data['rocket']:
if x:
response = requests.get("https://api.spacexdata.com/v4/rockets/"+str(x)).json()
BoosterVersion.append(response['name'])
From the launchpad, we would like to know the names of the launch sites, as well as their longitudes and latitudes. Using the launchpad column, the function calls the API and appends the data to the list using the given dataset.
def getLaunchSite(data):
for x in data['launchpad']:
if x:
response = requests.get("https://api.spacexdata.com/v4/launchpads/"+str(x)).json()
Longitude.append(response['longitude'])
Latitude.append(response['latitude'])
LaunchSite.append(response['name'])
From the payload, we would like to determine the payload's mass and its target orbit. The function use the payloads column of the dataset to call the API and append the data to the lists.
def getPayloadData(data):
for load in data['payloads']:
if load:
response = requests.get("https://api.spacexdata.com/v4/payloads/"+load).json()
PayloadMass.append(response['mass_kg'])
Orbit.append(response['orbit'])
We are interested in the landing outcome, landing type, number of flights with that core, whether gridfins were used, whether the core is reused, whether legs were used, the landing pad used, the core's block, which is a number used to separate versions of cores, the number of times this specific core has been reused, and the core's serial number. The function use the cores column of the dataset to access the API and append the data to the lists.
def getCoreData(data):
for core in data['cores']:
if core['core'] != None:
response = requests.get("https://api.spacexdata.com/v4/cores/"+core['core']).json()
Block.append(response['block'])
ReusedCount.append(response['reuse_count'])
Serial.append(response['serial'])
else:
Block.append(None)
ReusedCount.append(None)
Serial.append(None)
Outcome.append(str(core['landing_success'])+' '+str(core['landing_type']))
Flights.append(core['flight'])
GridFins.append(core['gridfins'])
Reused.append(core['reused'])
Legs.append(core['legs'])
LandingPad.append(core['landpad'])
Now let's begin requesting rocket launch data from the SpaceX API using the URL and examine the response's content using the following syntax. We expect the response to contain vast data about SpaceX launches.
spacex_url="https://api.spacexdata.com/v4/launches/past"
response = requests.get(spacex_url)
print(response.content)
For this project, we will utilize the following static response object to make the JSON results more consistent. We should see that the request was successful with status code 200.
static_json_url='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/API_call_spacex_api.json'
response.status_code
Now we decode the response content as JSON with .json() and convert it to a Pandas dataframe with .json normalize () method.
data = pd.json_normalize(response.json())
The vast bulk of the data in the collection consist of IDs. For example, the rocket column contains only a unique identifier and no more information. Again, we will use the API to retrieve information about the launches using the supplied IDs for each launch. Specifically, we will use rockets, payloads, launchpads, and core columns. First, let's create a subset of our dataframe, maintaining only the features we require, as well as the flight number and date utc. Second, we will delete rows with multiple cores, which correspond to falcon rockets with two additional rocket boosters, as well as rows with numerous payloads in a single rocket. After that, since payloads and cores are lists of size 1, we will also extract the list's single value and replace the feature. When extracting the date from the time, we also need to transform the date utc to a datetime datatype. Finally, we will use the date to restrict the launch dates.
data = data[['rocket', 'payloads', 'launchpad', 'cores', 'flight_number', 'date_utc']]
data = data[data['cores'].map(len)==1]
data = data[data['payloads'].map(len)==1]
data['cores'] = data['cores'].map(lambda x : x[0])
data['payloads'] = data['payloads'].map(lambda x : x[0])
data['date'] = pd.to_datetime(data['date_utc']).dt.date
data = data[data['date'] <= datetime.date(2020, 11, 13)]
To restate our aims once again:
- We wish to identify the booster's name from the rocket column.
- From the payload column, we would like to determine the payload's mass and its intended orbit.
- We would like to know the name of the launch site, its longitude, and its latitude from the launchpad column.
- From the cores column, we would like to learn the landing outcome, the type of landing, the number of flights with that core, whether gridfins were used, whether the core is reused, whether legs were used, the landing pad used, the block of the core, which is a number used to distinguish versions of cores, the number of times this specific core has been reused, and the serial number of the core.
These requests' data will be kept in following lists and utilized to generate a new dataframe.
BoosterVersion = []
PayloadMass = []
Orbit = []
LaunchSite = []
Outcome = []
Flights = []
GridFins = []
Reused = []
Legs = []
LandingPad = []
Block = []
ReusedCount = []
Serial = []
Longitude = []
Latitude = []
Afterwards, we execute the functions we've already created so that their outputs can be assigned to each of the above lists.
getBoosterVersion(data)
getLaunchSite(data)
getPayloadData(data)
getCoreData(data)
Now that we have collected enough information, let's use it to build our dataset. The columns are combined to form a dictionary. Subsequently, a Pandas data frame must be created from the dictionary launch_dict.
launch_dict = {'FlightNumber': list(data['flight_number']),
'Date': list(data['date']),
'BoosterVersion':BoosterVersion,
'PayloadMass':PayloadMass,
'Orbit':Orbit,
'LaunchSite':LaunchSite,
'Outcome':Outcome,
'Flights':Flights,
'GridFins':GridFins,
'Reused':Reused,
'Legs':Legs,
'LandingPad':LandingPad,
'Block':Block,
'ReusedCount':ReusedCount,
'Serial':Serial,
'Longitude': Longitude,
'Latitude': Latitude}
launch_data=pd.DataFrame(launch_dict)
launch_data.head()
In the figure below, the first five rows of the resulting dataframe include unnecessary information. We will only retain Falcon 9 launches, eliminating Falcon 1 launches. Using the BoosterVersion column, filter the data frame to retain only the Falcon 9 launches. Export the filtered data to a new dataframe called data falcon9. The FlightNumber column will then be reset, as we have eliminated certain values.
The code below is used to filter only Falcon 9 launches based on the BoosterVersion column and to reset the FlightNumber column values.
data_falcon9=launch_data[launch_data["BoosterVersion"]!="Falcon 1"]
data_falcon9.loc[:,'FlightNumber'] = list(range(1, data_falcon9.shape[0]+1))
data_falcon9
Controlling the missing values, if there are any, is the final step. We will useö.isnull() method to check for missing values and .sum() method to aggregate the quantity of missing records in each column.
data_falcon9.isnull().sum()
In our dataset, we can see that few rows are missing values. Before proceeding, we must address these missing values. The LandingPad column will preserve None values to indicate landing pads that were not used. We will calculate the mean PayloadMass using .mean() function . The calculated mean is then used in conjunction with .replace() method to replace np.nan values in the data with the calculated mean.
mean_PayloadMass=data_falcon9["PayloadMass"].mean()
data_falcon9["PayloadMass"] = data_falcon9["PayloadMass"].replace(np.nan, mean_PayloadMass)
data_falcon9.isnull().sum()
The number of missing values in the PayLoadMass should become zero. We should no longer have any missing variables in our dataset, with the exception of LandingPad. For the subsequent section, we can now export the dataframe for Falcon 9 as a CSV.
data_falcon9.to_csv('dataset_part_1.csv', index=False)