Pandas is a fast, powerful and easy data analysis and manipulating tool
import pandas as pd
a serie is one dimensionsl array that holds any data type. to create a series using pandas:
` s = pd.Series(data, index=index)`
d = {"b": 1, "a": 0, "c": 2}`
pd.Series(d)
pd.Series(d, index=["b", "c", "d", "a"])
output:
b 1.0
c 2.0
d NaN
a 0.0
The main data structure of pandas.
dates = pd.date_range("20220101", periods=6)
df = pd.DataFrame(np.random.rand(6 ,4), index = dates, columns=list("ABCD"))
df
output –>
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
output --->
A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo ```
To view the top and bottom rows of the frame:
Gives a NumPy representation of the underlying data:
NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.
.loc is strict when you present slicers that are not compatible (or convertible) with the index type.
Setting a new column automatically aligns the data by the indexes
pandas primarily uses the value np.nan to represent missing data.
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1.loc[dates[0] : dates[1], "E"] = 1
df1.dropna(how=”any”)
df1.fillna(value=5)
pd.isna(df1)
Operations in general exclude missing data.
Performing a descriptive statistic:
df.mean()
A -0.004474
B -0.383981
C -0.687758
D 5.000000
F 3.000000
more pandas operations: link