IBM Capstone Project
SpaceX - Exploratory Data Analysis Using SQL
This is the capstone project required to get the IBM Data Science Professional Certificate. Yan Luo, a data scientist and developer, and Joseph Santarcangelo, both data scientists at IBM, directed the project. The project will be presented in seven sections, and the lecture Jupyter notebooks and tutorials were used to compile the contents.
As a data scientist, I was tasked with forecasting if the first stage of the SpaceX Falcon 9 rocket will land successfully, so that a rival firm might submit better informed bids for a rocket launch against SpaceX. On its website, SpaceX promotes Falcon 9 rocket launches for 62 million dollars, whereas other companies charge upwards of 165 million dollars. A significant portion of the savings is attributable to SpaceX's ability to reuse the first stage. If we can determine whether the first stage will land, we can calculate the launch cost. This information might be useful if an alternative company want to compete with SpaceX for a rocket launch. In this project, I will conduct data science methodology including business understanding, data collection, data wrangling, exploratory data analysis, data visualization, model development, model evaluation, and stakeholder reporting.
In the fourth phase of the project, we will store our data in a sqlite3 database and execute SQL queries to retrieve the answers to the questions, thereby gaining a better understanding of the Spacex DataSet.
Before we begin, we will install SQLAlchemy, which offers a "Pythonic" language for interacting with databases. Instead of dealing with the distinctions between MySQL, PostgreSQL, and Oracle dialects of traditional SQL, you can employ the Pythonic structure of SQLAlchemy to streamline your workflow and query data more efficiently.
!pip install sqlalchemy==1.3.9
The next step is to connect to the database after the SQL extension has been loaded.
%load_ext sql
import csv, sqlite3
con = sqlite3.connect("my_data1.db")
cur = con.cursor()
%sql sqlite:///my_data1.db
Following a successful database connection, we import our dataframe from the URL provided by IBM and convert it into a SQL table named SPACEXTBL in order to conduct SQL queries.
import pandas as pd
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_2/data/Spacex.csv")
df.to_sql("SPACEXTBL", con, if_exists='replace', index=False, method="multi")
%sql SELECT * FROM SPACEXTBL
The following SQL query lists the unique launch locations for the space mission.
%sql SELECT DISTINCT LAUNCH_SITE as "Launch_Sites" FROM SPACEXTBL;
The subsequent SQL query returns 5 results when the launch site name starts with "CCA."
%sql SELECT LAUNCH_SITE FROM SPACEXTBL WHERE LAUNCH_SITE LIKE 'CCA%' LIMIT 5;
The following is an SQL query showing the total payload mass carried by NASA boosters (CRS)
%sql SELECT SUM(PAYLOAD_MASS__KG_) AS "TOTAL PAYLOAD MASS BY NASA(CRS)" FROM SPACEXTBL WHERE CUSTOMER = "NASA (CRS)";
With the help of the following SQL query, we can see the average payload size carried by booster F9 v1.1.
%sql SELECT AVG(PAYLOAD_MASS__KG_) FROM SPACEXTBL WHERE BOOSTER_VERSION = 'F9 v1.1';
We can find out when the first successful landing on a ground pad occurred using the following SQL query.
%sql SELECT MIN(DATE) AS "First Successful Landing" FROM SPACEXTBL WHERE "Landing _Outcome"='Success (ground pad)';
With the following SQL query, we can generate a list of boosters whose payload mass is larger than 4000 but less than 6000 and has been successfully deployed aboard a drone ship.
%sql SELECT BOOSTER_VERSION FROM SPACEXTBL WHERE "Landing _Outcome"='Success (drone ship)' AND PAYLOAD_MASS__KG_ > 4000 AND PAYLOAD_MASS__KG_ < 6000;
To see how many missions were completed successfully or unsuccessfully, we may use the following SQL query.
%%sql SELECT (SELECT COUNT("MISSION_OUTCOME") FROM SPACEXTBL WHERE "MISSION_OUTCOME" LIKE '%Success%') AS SUCCESS,
(SELECT COUNT("MISSION_OUTCOME") FROM SPACEXTBL WHERE "MISSION_OUTCOME" LIKE '%Failure%') AS FAILURE
To find out which booster versions have carried the heaviest payloads, run the following SQL query with subqyery.
%sql SELECT DISTINCT BOOSTER_VERSION, PAYLOAD_MASS__KG_ FROM SPACEXTBL WHERE PAYLOAD_MASS__KG_ =(SELECT MAX(PAYLOAD_MASS__KG_) FROM SPACEXTBL);
The following SQL query is used to list the month names, failure landing outcomes in drone ship, booster versions, and launch site for the months in 2015.
Note: SQLLite does not allow monthnames. To obtain the months, we must therefore use substr(Date, 4, 2) and substr(Date,7,4)='2015'.
%%sql SELECT substr("DATE", 4, 2) AS MONTH, "BOOSTER_VERSION", "LAUNCH_SITE" FROM SPACEXTBL
WHERE "LANDING _OUTCOME" = 'Failure (drone ship)' and substr("DATE",7,4) = '2015';
The following SQL query ranks landing outcomes in descending order from 04-06-2010 to 20-03-2017.
%%sql SELECT "Landing _Outcome" as "Landing Outcome", COUNT("Landing _Outcome") AS "Total Count" FROM SPACEXTBL WHERE DATE BETWEEN '04-06-2010' AND '20-03-2017'
GROUP BY "Landing _Outcome"
ORDER BY COUNT("Landing _Outcome") DESC ;