Introduction - Data Analysis with Python 3 and Pandas




Practical Data Analysis with Pandas tutorial 1 Intro

Welcome to a data analysis tutorial with Python and the Pandas data analysis library.

The field of data analytics is quite large and what you might be aiming to do with it is likely to never match up exactly to any tutorial. With that in mind, I think the best way for us to approach learning data analysis with Python is simply by example. My plan here is to find some datasets and do some of the common data analysis tasks, using the Pandas package, to hopefully get you familiar enough with the package to work with it on your own.

To begin, let's make sure we're all on the same page.

I will be using Python 3.7 and Pands 0.24.1

You can likely follow along with different versions of things, just know there may be minor differences that you will need to work out. With Pandas, I have personally found I can usually google my errors with a high degree of success.

So, after you've got Python and done a pip install pandas, you're ready!

There will be quite a few packages and libraries that we install through the course of this series. If you'd rather focus on the code and not getting packages, you can check out a pre-compiled and optimized distribution of Python from Activestate, which will have everything you will need to follow along with this series. Get .

Let's jump in!

Oh, wait, we probably should have a dataset too.

The internet is stuffed full of datasets, so there are many to choose from. I am personally going to be using datasets from Kaggle.

If you are not familiar, Kaggle is a data analysis competitions website. I think that, if you're looking to practice real-world data analysis challenges, Kaggle is the single best place to do it, even if you're not looking to compete.

Many, if not most, of the competitions on Kaggle are actual company problems. Things just like I get often asked to do in my contract work or that you might be asked to do if you find employment as a data analyst. These are typically "unsolved" types of problems, rather than simpler, solved, issues that you will typically encounter in tutorials.

I don't think we're quite ready to jump into anything serious, so let's find a simpler dataset to start with. To find datasets, check out theKaggle Datasets. Tons of goodies here.

To begin, let's check out Avocado Prices. I absolutely adore avocados! Did you know avocados are a fruit? Most closely classified as ... a berry! Imagine getting some "mixed berries" flavored thing, and there's avocado in there. Hah!

Anyway, download that dataset. You will need to login/create an account to use Kaggle, but you should. If for whatever reason you don't want to, or the dataset is missing, I will also host it here: Avocado Prices.

Unzip the file using whatever you use to zip/unzip things, and you're left with a CSV file.

CSV files are highly common file types that you will find with data analysis. The structure of a CSV tends to be something that is meant to be organized by columns and rows, where the file itself has values, separted by commas (hey is that where the name CSV comes from!?!) and then the rows are separated by new lines in the document. So, let's read this csv in with Pandas.

For now, let's make sure our file is in the same working directory as our Python script or in a directory like "datasets." I will be doing the latter, but you can feel free to do as you wish. So, to begin, we have a file called avocado.csv and we want to load that into pandas. It's a CSV file, so it's already in a sort of columns and rows format, we just want to load that into a pandas dataframe.

To do this, we will use a method called read_csv. Let's see how that works. I am going to be doing this in a Jupyter Notebook. You can use whatever editor that you like, but the Jupyer notebooks are pretty useful for data analysis and just general poking around with data. To use them, you can just do:

pip install jupyterlab

Then in a terminal/command prompt, you can do:

jupyter lab

Then you can go file > new > notebook, pick Python 3, and you're good to go! Let's start by loading in a file.

import pandas as pd  # convention to import and use pandas like this

df = pd.read_csv("datasets/avocado.csv")  # df stands for dataframe. Also a common convention to call this df

A dataframe is a type of pandas object that is basically a "table" like object with columns and rows, which we can also perform various calcuations and statistical operations..etc on. We can print it out:

df
Unnamed: 0 Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
0 0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.00 conventional 2015 Albany
1 1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.00 conventional 2015 Albany
2 2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.00 conventional 2015 Albany
3 3 2015-12-06 1.08 78992.15 1132.00 71976.41 72.58 5811.16 5677.40 133.76 0.00 conventional 2015 Albany
4 4 2015-11-29 1.28 51039.60 941.48 43838.39 75.78 6183.95 5986.26 197.69 0.00 conventional 2015 Albany
5 5 2015-11-22 1.26 55979.78 1184.27 48067.99 43.61 6683.91 6556.47 127.44 0.00 conventional 2015 Albany
6 6 2015-11-15 0.99 83453.76 1368.92 73672.72 93.26 8318.86 8196.81 122.05 0.00 conventional 2015 Albany
7 7 2015-11-08 0.98 109428.33 703.75 101815.36 80.00 6829.22 6266.85 562.37 0.00 conventional 2015 Albany
8 8 2015-11-01 1.02 99811.42 1022.15 87315.57 85.34 11388.36 11104.53 283.83 0.00 conventional 2015 Albany
9 9 2015-10-25 1.07 74338.76 842.40 64757.44 113.00 8625.92 8061.47 564.45 0.00 conventional 2015 Albany
10 10 2015-10-18 1.12 84843.44 924.86 75595.85 117.07 8205.66 7877.86 327.80 0.00 conventional 2015 Albany
11 11 2015-10-11 1.28 64489.17 1582.03 52677.92 105.32 10123.90 9866.27 257.63 0.00 conventional 2015 Albany
12 12 2015-10-04 1.31 61007.10 2268.32 49880.67 101.36 8756.75 8379.98 376.77 0.00 conventional 2015 Albany
13 13 2015-09-27 0.99 106803.39 1204.88 99409.21 154.84 6034.46 5888.87 145.59 0.00 conventional 2015 Albany
14 14 2015-09-20 1.33 69759.01 1028.03 59313.12 150.50 9267.36 8489.10 778.26 0.00 conventional 2015 Albany
15 15 2015-09-13 1.28 76111.27 985.73 65696.86 142.00 9286.68 8665.19 621.49 0.00 conventional 2015 Albany
16 16 2015-09-06 1.11 99172.96 879.45 90062.62 240.79 7990.10 7762.87 227.23 0.00 conventional 2015 Albany
17 17 2015-08-30 1.07 105693.84 689.01 94362.67 335.43 10306.73 10218.93 87.80 0.00 conventional 2015 Albany
18 18 2015-08-23 1.34 79992.09 733.16 67933.79 444.78 10880.36 10745.79 134.57 0.00 conventional 2015 Albany
19 19 2015-08-16 1.33 80043.78 539.65 68666.01 394.90 10443.22 10297.68 145.54 0.00 conventional 2015 Albany
20 20 2015-08-09 1.12 111140.93 584.63 100961.46 368.95 9225.89 9116.34 109.55 0.00 conventional 2015 Albany
21 21 2015-08-02 1.45 75133.10 509.94 62035.06 741.08 11847.02 11768.52 78.50 0.00 conventional 2015 Albany
22 22 2015-07-26 1.11 106757.10 648.75 91949.05 966.61 13192.69 13061.53 131.16 0.00 conventional 2015 Albany
23 23 2015-07-19 1.26 96617.00 1042.10 82049.40 2238.02 11287.48 11103.49 183.99 0.00 conventional 2015 Albany
24 24 2015-07-12 1.05 124055.31 672.25 94693.52 4257.64 24431.90 24290.08 108.49 33.33 conventional 2015 Albany
25 25 2015-07-05 1.35 109252.12 869.45 72600.55 5883.16 29898.96 29663.19 235.77 0.00 conventional 2015 Albany
26 26 2015-06-28 1.37 89534.81 664.23 57545.79 4662.71 26662.08 26311.76 350.32 0.00 conventional 2015 Albany
27 27 2015-06-21 1.27 104849.39 804.01 76688.55 5481.18 21875.65 21662.00 213.65 0.00 conventional 2015 Albany
28 28 2015-06-14 1.32 89631.30 850.58 55400.94 4377.19 29002.59 28343.14 659.45 0.00 conventional 2015 Albany
29 29 2015-06-07 1.07 122743.06 656.71 99220.82 90.32 22775.21 22314.99 460.22 0.00 conventional 2015 Albany
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18219 6 2018-02-11 1.56 1317000.47 98465.26 270798.27 1839.80 945638.02 768242.42 177144.00 251.60 organic 2018 TotalUS
18220 7 2018-02-04 1.53 1384683.41 117922.52 287724.61 1703.52 977084.84 774695.74 201878.69 510.41 organic 2018 TotalUS
18221 8 2018-01-28 1.61 1336979.09 118616.17 280080.34 1270.61 936859.49 796104.27 140652.84 102.38 organic 2018 TotalUS
18222 9 2018-01-21 1.63 1283987.65 108705.28 259172.13 1490.02 914409.26 710654.40 203526.59 228.27 organic 2018 TotalUS
18223 10 2018-01-14 1.59 1476651.08 145680.62 323669.83 1580.01 1005593.78 858772.69 146808.97 12.12 organic 2018 TotalUS
18224 11 2018-01-07 1.51 1517332.70 129541.43 296490.29 1289.07 1089861.24 915452.78 174381.57 26.89 organic 2018 TotalUS
18225 0 2018-03-25 1.60 271723.08 26996.28 77861.39 117.56 166747.85 87108.00 79495.39 144.46 organic 2018 West
18226 1 2018-03-18 1.73 210067.47 33437.98 47165.54 110.40 129353.55 73163.12 56020.24 170.19 organic 2018 West
18227 2 2018-03-11 1.63 264691.87 27566.25 60383.57 276.42 176465.63 107174.93 69290.70 0.00 organic 2018 West
18228 3 2018-03-04 1.46 347373.17 25990.60 71213.19 79.01 250090.37 85835.17 164087.33 167.87 organic 2018 West
18229 4 2018-02-25 1.49 301985.61 34200.18 49139.34 85.58 218560.51 99989.62 118314.77 256.12 organic 2018 West
18230 5 2018-02-18 1.64 224798.60 30149.00 38800.64 123.13 155725.83 120428.13 35257.73 39.97 organic 2018 West
18231 6 2018-02-11 1.47 275248.53 24732.55 61713.53 243.00 188559.45 88497.05 99810.80 251.60 organic 2018 West
18232 7 2018-02-04 1.41 283378.47 22474.66 55360.49 133.41 205409.91 70232.59 134666.91 510.41 organic 2018 West
18233 8 2018-01-28 1.80 185974.53 22918.40 33051.14 93.52 129911.47 77822.23 51986.86 102.38 organic 2018 West
18234 9 2018-01-21 1.83 189317.99 27049.44 33561.32 439.47 128267.76 76091.99 51947.50 228.27 organic 2018 West
18235 10 2018-01-14 1.82 207999.67 33869.12 47435.14 433.52 126261.89 89115.78 37133.99 12.12 organic 2018 West
18236 11 2018-01-07 1.48 297190.60 34734.97 62967.74 157.77 199330.12 103761.55 95544.39 24.18 organic 2018 West
18237 0 2018-03-25 1.62 15303.40 2325.30 2171.66 0.00 10806.44 10569.80 236.64 0.00 organic 2018 WestTexNewMexico
18238 1 2018-03-18 1.56 15896.38 2055.35 1499.55 0.00 12341.48 12114.81 226.67 0.00 organic 2018 WestTexNewMexico
18239 2 2018-03-11 1.56 22128.42 2162.67 3194.25 8.93 16762.57 16510.32 252.25 0.00 organic 2018 WestTexNewMexico
18240 3 2018-03-04 1.54 17393.30 1832.24 1905.57 0.00 13655.49 13401.93 253.56 0.00 organic 2018 WestTexNewMexico
18241 4 2018-02-25 1.57 18421.24 1974.26 2482.65 0.00 13964.33 13698.27 266.06 0.00 organic 2018 WestTexNewMexico
18242 5 2018-02-18 1.56 17597.12 1892.05 1928.36 0.00 13776.71 13553.53 223.18 0.00 organic 2018 WestTexNewMexico
18243 6 2018-02-11 1.57 15986.17 1924.28 1368.32 0.00 12693.57 12437.35 256.22 0.00 organic 2018 WestTexNewMexico
18244 7 2018-02-04 1.63 17074.83 2046.96 1529.20 0.00 13498.67 13066.82 431.85 0.00 organic 2018 WestTexNewMexico
18245 8 2018-01-28 1.71 13888.04 1191.70 3431.50 0.00 9264.84 8940.04 324.80 0.00 organic 2018 WestTexNewMexico
18246 9 2018-01-21 1.87 13766.76 1191.92 2452.79 727.94 9394.11 9351.80 42.31 0.00 organic 2018 WestTexNewMexico
18247 10 2018-01-14 1.93 16205.22 1527.63 2981.04 727.01 10969.54 10919.54 50.00 0.00 organic 2018 WestTexNewMexico
18248 11 2018-01-07 1.62 17489.58 2894.77 2356.13 224.53 12014.15 11988.14 26.01 0.00 organic 2018 WestTexNewMexico

18249 rows A-- 14 columns

Okay, that's a bit messy to print that out everytime. Often, we just want to see a small snippet of our dataframe just to make sure everything is what we expect. Most people will use the .head() method for this:

df.head()
Unnamed: 0 Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
0 0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.0 conventional 2015 Albany
1 1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.0 conventional 2015 Albany
2 2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.0 conventional 2015 Albany
3 3 2015-12-06 1.08 78992.15 1132.00 71976.41 72.58 5811.16 5677.40 133.76 0.0 conventional 2015 Albany
4 4 2015-11-29 1.28 51039.60 941.48 43838.39 75.78 6183.95 5986.26 197.69 0.0 conventional 2015 Albany

You can pass a parameter to the head, which is how many rows you want. Like

df.head(3)
Unnamed: 0 Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
0 0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.0 conventional 2015 Albany
1 1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.0 conventional 2015 Albany
2 2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.0 conventional 2015 Albany

Often, you may apply rolling window types of operations, where the head will wind up containing NAN type data, and instead you want to see the end. You can do that too with .tail()

df.tail(6)
Unnamed: 0 Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
18243 6 2018-02-11 1.57 15986.17 1924.28 1368.32 0.00 12693.57 12437.35 256.22 0.0 organic 2018 WestTexNewMexico
18244 7 2018-02-04 1.63 17074.83 2046.96 1529.20 0.00 13498.67 13066.82 431.85 0.0 organic 2018 WestTexNewMexico
18245 8 2018-01-28 1.71 13888.04 1191.70 3431.50 0.00 9264.84 8940.04 324.80 0.0 organic 2018 WestTexNewMexico
18246 9 2018-01-21 1.87 13766.76 1191.92 2452.79 727.94 9394.11 9351.80 42.31 0.0 organic 2018 WestTexNewMexico
18247 10 2018-01-14 1.93 16205.22 1527.63 2981.04 727.01 10969.54 10919.54 50.00 0.0 organic 2018 WestTexNewMexico
18248 11 2018-01-07 1.62 17489.58 2894.77 2356.13 224.53 12014.15 11988.14 26.01 0.0 organic 2018 WestTexNewMexico

We can also reference specific columns, like:

df['AveragePrice'].head()
0    1.33
1    1.35
2    0.93
3    1.08
4    1.28
Name: AveragePrice, dtype: float64

Also, you can use attribute-like dot notation like:

df.AveragePrice.head()
0    1.33
1    1.35
2    0.93
3    1.08
4    1.28
Name: AveragePrice, dtype: float64

But most people use the dict-like methodology. I am not sure if I have ever seen the attribute-like method, so probably don't do it, just know that other people might! A common goal with data analysis is to visualize data. We all love pretty graphs, plus they help us generalize data usually pretty well. So, how might we graph this data. Looking at the data, it's clear that it's actually organized by date, but also region, so we could plot line graphs of individual regions over time.

To do this, we'll need matplotlib, which is a popular data visualization library. To get it, let's do:

pip install matplotlib

Next, how might we get an individual region? We'd need to filter for that region column! Let's see how we might do that:

albany_df = df[df['region']=="Albany"]

Ok, so that might look a bit dense, but let's parse that out.

albany_df = df[ df['region'] == "Albany" ]

We're just saying that the albany_df is the df, where the df['region'] column is equal to Albany. The result is a new dataframe where this is the case:

albany_df.head()
Unnamed: 0 Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
0 0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.0 conventional 2015 Albany
1 1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.0 conventional 2015 Albany
2 2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.0 conventional 2015 Albany
3 3 2015-12-06 1.08 78992.15 1132.00 71976.41 72.58 5811.16 5677.40 133.76 0.0 conventional 2015 Albany
4 4 2015-11-29 1.28 51039.60 941.48 43838.39 75.78 6183.95 5986.26 197.69 0.0 conventional 2015 Albany

Okay, so one more thing you will often see is dataframes are "indexed" by something. Let's see what this dataframe is indexed by:

albany_df.index
Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            17603, 17604, 17605, 17606, 17607, 17608, 17609, 17610, 17611,
            17612],
           dtype='int64', length=338)

In this case, the index is worthless to us. It's just incrementing row counts, which we have no use for here. Instead, we should ask ourselves, how is this Albany avocado data organized? How does each row relate to the other? Well, by date. That's the main way this data is organized. So really, we want Date to be our index! We can do this with set_index.

albany_df.set_index("Date")
Unnamed: 0 AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
Date
2015-12-27 0 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.00 conventional 2015 Albany
2015-12-20 1 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.00 conventional 2015 Albany
2015-12-13 2 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.00 conventional 2015 Albany
2015-12-06 3 1.08 78992.15 1132.00 71976.41 72.58 5811.16 5677.40 133.76 0.00 conventional 2015 Albany
2015-11-29 4 1.28 51039.60 941.48 43838.39 75.78 6183.95 5986.26 197.69 0.00 conventional 2015 Albany
2015-11-22 5 1.26 55979.78 1184.27 48067.99 43.61 6683.91 6556.47 127.44 0.00 conventional 2015 Albany
2015-11-15 6 0.99 83453.76 1368.92 73672.72 93.26 8318.86 8196.81 122.05 0.00 conventional 2015 Albany
2015-11-08 7 0.98 109428.33 703.75 101815.36 80.00 6829.22 6266.85 562.37 0.00 conventional 2015 Albany
2015-11-01 8 1.02 99811.42 1022.15 87315.57 85.34 11388.36 11104.53 283.83 0.00 conventional 2015 Albany
2015-10-25 9 1.07 74338.76 842.40 64757.44 113.00 8625.92 8061.47 564.45 0.00 conventional 2015 Albany
2015-10-18 10 1.12 84843.44 924.86 75595.85 117.07 8205.66 7877.86 327.80 0.00 conventional 2015 Albany
2015-10-11 11 1.28 64489.17 1582.03 52677.92 105.32 10123.90 9866.27 257.63 0.00 conventional 2015 Albany
2015-10-04 12 1.31 61007.10 2268.32 49880.67 101.36 8756.75 8379.98 376.77 0.00 conventional 2015 Albany
2015-09-27 13 0.99 106803.39 1204.88 99409.21 154.84 6034.46 5888.87 145.59 0.00 conventional 2015 Albany
2015-09-20 14 1.33 69759.01 1028.03 59313.12 150.50 9267.36 8489.10 778.26 0.00 conventional 2015 Albany
2015-09-13 15 1.28 76111.27 985.73 65696.86 142.00 9286.68 8665.19 621.49 0.00 conventional 2015 Albany
2015-09-06 16 1.11 99172.96 879.45 90062.62 240.79 7990.10 7762.87 227.23 0.00 conventional 2015 Albany
2015-08-30 17 1.07 105693.84 689.01 94362.67 335.43 10306.73 10218.93 87.80 0.00 conventional 2015 Albany
2015-08-23 18 1.34 79992.09 733.16 67933.79 444.78 10880.36 10745.79 134.57 0.00 conventional 2015 Albany
2015-08-16 19 1.33 80043.78 539.65 68666.01 394.90 10443.22 10297.68 145.54 0.00 conventional 2015 Albany
2015-08-09 20 1.12 111140.93 584.63 100961.46 368.95 9225.89 9116.34 109.55 0.00 conventional 2015 Albany
2015-08-02 21 1.45 75133.10 509.94 62035.06 741.08 11847.02 11768.52 78.50 0.00 conventional 2015 Albany
2015-07-26 22 1.11 106757.10 648.75 91949.05 966.61 13192.69 13061.53 131.16 0.00 conventional 2015 Albany
2015-07-19 23 1.26 96617.00 1042.10 82049.40 2238.02 11287.48 11103.49 183.99 0.00 conventional 2015 Albany
2015-07-12 24 1.05 124055.31 672.25 94693.52 4257.64 24431.90 24290.08 108.49 33.33 conventional 2015 Albany
2015-07-05 25 1.35 109252.12 869.45 72600.55 5883.16 29898.96 29663.19 235.77 0.00 conventional 2015 Albany
2015-06-28 26 1.37 89534.81 664.23 57545.79 4662.71 26662.08 26311.76 350.32 0.00 conventional 2015 Albany
2015-06-21 27 1.27 104849.39 804.01 76688.55 5481.18 21875.65 21662.00 213.65 0.00 conventional 2015 Albany
2015-06-14 28 1.32 89631.30 850.58 55400.94 4377.19 29002.59 28343.14 659.45 0.00 conventional 2015 Albany
2015-06-07 29 1.07 122743.06 656.71 99220.82 90.32 22775.21 22314.99 460.22 0.00 conventional 2015 Albany
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2017-04-30 35 1.74 3046.63 388.81 280.28 0.00 2377.54 2377.54 0.00 0.00 organic 2017 Albany
2017-04-23 36 1.92 2087.60 110.25 182.56 0.00 1794.79 1794.79 0.00 0.00 organic 2017 Albany
2017-04-16 37 1.85 2886.48 265.82 203.84 0.00 2416.82 2416.82 0.00 0.00 organic 2017 Albany
2017-04-09 38 1.92 2209.82 159.65 189.67 0.00 1860.50 1860.50 0.00 0.00 organic 2017 Albany
2017-04-02 39 1.86 3492.87 885.46 362.37 0.00 2245.04 2245.04 0.00 0.00 organic 2017 Albany
2017-03-26 40 2.02 2250.22 166.49 263.32 0.00 1820.41 1820.41 0.00 0.00 organic 2017 Albany
2017-03-19 41 1.87 2763.38 503.14 175.98 0.00 2084.26 2084.26 0.00 0.00 organic 2017 Albany
2017-03-12 42 1.97 2001.95 123.51 206.64 0.00 1671.80 1671.80 0.00 0.00 organic 2017 Albany
2017-03-05 43 1.84 2228.14 241.00 208.79 0.00 1778.35 1778.35 0.00 0.00 organic 2017 Albany
2017-02-26 44 1.71 2185.96 508.31 240.10 0.00 1437.55 1437.55 0.00 0.00 organic 2017 Albany
2017-02-19 45 1.67 2523.56 1049.50 141.41 0.00 1332.65 1332.65 0.00 0.00 organic 2017 Albany
2017-02-12 46 1.78 1806.40 119.52 170.57 0.00 1516.31 1516.31 0.00 0.00 organic 2017 Albany
2017-02-05 47 1.72 1753.35 26.75 223.33 0.00 1503.27 1503.27 0.00 0.00 organic 2017 Albany
2017-01-29 48 1.86 1795.81 32.53 123.14 0.00 1640.14 1640.14 0.00 0.00 organic 2017 Albany
2017-01-22 49 1.82 1897.07 78.83 128.24 0.00 1690.00 1690.00 0.00 0.00 organic 2017 Albany
2017-01-15 50 1.84 1982.65 82.30 328.02 0.00 1572.33 1572.33 0.00 0.00 organic 2017 Albany
2017-01-08 51 1.94 2229.52 63.46 478.31 0.00 1687.75 1687.75 0.00 0.00 organic 2017 Albany
2017-01-01 52 1.87 1376.70 71.65 192.63 0.00 1112.42 1112.42 0.00 0.00 organic 2017 Albany
2018-03-25 0 1.71 2321.82 42.95 272.41 0.00 2006.46 1996.46 10.00 0.00 organic 2018 Albany
2018-03-18 1 1.66 3154.45 275.89 297.96 0.00 2580.60 2577.27 3.33 0.00 organic 2018 Albany
2018-03-11 2 1.68 2570.52 131.67 229.56 0.00 2209.29 2209.29 0.00 0.00 organic 2018 Albany
2018-03-04 3 1.48 3851.30 311.55 296.77 0.00 3242.98 3239.65 3.33 0.00 organic 2018 Albany
2018-02-25 4 1.56 5356.63 816.56 532.59 0.00 4007.48 4007.48 0.00 0.00 organic 2018 Albany
2018-02-18 5 1.43 7566.17 4314.30 251.85 0.00 3000.02 3000.02 0.00 0.00 organic 2018 Albany
2018-02-11 6 1.43 3817.93 59.18 289.85 0.00 3468.90 3468.90 0.00 0.00 organic 2018 Albany
2018-02-04 7 1.52 4124.96 118.38 420.36 0.00 3586.22 3586.22 0.00 0.00 organic 2018 Albany
2018-01-28 8 1.32 6987.56 433.66 374.96 0.00 6178.94 6178.94 0.00 0.00 organic 2018 Albany
2018-01-21 9 1.54 3346.54 14.67 253.01 0.00 3078.86 3078.86 0.00 0.00 organic 2018 Albany
2018-01-14 10 1.47 4140.95 7.30 301.87 0.00 3831.78 3831.78 0.00 0.00 organic 2018 Albany
2018-01-07 11 1.54 4816.90 43.51 412.17 0.00 4361.22 4357.89 3.33 0.00 organic 2018 Albany

338 rows A-- 13 columns

Wait, what? Why did it print out like that? Part of the benefit of the notebook is that this happened to us, but I would explain this either way. Some of the methods in pandas will modify your dataframe in place, but MOST are going to simply do the thing and return a new dataframe. So if we just check real quick:

albany_df.head()
Unnamed: 0 Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
0 0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.0 conventional 2015 Albany
1 1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.0 conventional 2015 Albany
2 2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.0 conventional 2015 Albany
3 3 2015-12-06 1.08 78992.15 1132.00 71976.41 72.58 5811.16 5677.40 133.76 0.0 conventional 2015 Albany
4 4 2015-11-29 1.28 51039.60 941.48 43838.39 75.78 6183.95 5986.26 197.69 0.0 conventional 2015 Albany

We can see that the albany_df is not impacted. There are two ways we can handle for this. One is to re-define:

albany_df = albany_df.set_index("Date")
albany_df.head()
Unnamed: 0 AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
Date
2015-12-27 0 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.0 conventional 2015 Albany
2015-12-20 1 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.0 conventional 2015 Albany
2015-12-13 2 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.0 conventional 2015 Albany
2015-12-06 3 1.08 78992.15 1132.00 71976.41 72.58 5811.16 5677.40 133.76 0.0 conventional 2015 Albany
2015-11-29 4 1.28 51039.60 941.48 43838.39 75.78 6183.95 5986.26 197.69 0.0 conventional 2015 Albany

The other option we can use is the inplace parameter. Something like:

albany_df.set_index("Date", inplace=True)

would also work. Okay, now that we've done that, let's plot!

albany_df['AveragePrice'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x11dd80940>

When we call .plot() on a dataframe, it is just assumed that the x axis will be your index, and then Y will be all of your columns, which is why we specified one column in particular.

This graph is a bit messy, however, especially with the dates, which also look out of order and such. Let's see if we can't carry on with this in the next tutorial!

The next tutorial:





  • Introduction - Data Analysis with Python 3 and Pandas
  • Graphing/visualization - Data Analysis with Python 3 and Pandas
  • Groupby - Data Analysis with Python 3 and Pandas
  • Visualizing Correlation Table - Data Analysis with Python 3 and Pandas
  • Combining multiple datasets - Data Analysis with Python 3 and Pandas
  • Machine Learning with Scikit-learn - Data Analysis with Python 3 and Pandas