樞紐分析表 – Pivot

大部份的主管，都希望看到如下的表格，讓他們可以仔細分析每種產品每日的銷售狀況。

日期            香蕉     蘋果    橘子    芭樂     蓮霧    芒果
2021-01-01	15	0	20	30	20	0
2021-01-02	0	10	45	0	0	30
2021-01-03	20	0	0	0	35	0
2021-01-04	30	60	0	0	0	0
2021-01-05	20	0	0	0	0	50
2021-01-06	0	0	65	0	0	0

這也讓製表的人產生了錯覺，以為資料就必需設計成這種格式。尤其是資料庫設計人員，常會設計成如下錯誤的格式。

id     INT    PK   NN   AI
日期   date                     
香蕉   int
蘋果   int
橘子   int
芭樂   int
蓮霧   int
芒果   int

別懷疑，本人看過90%以上的同事朋友，都是這麼搞的。

流水帳

資料庫的設計，其實必需使用如下流水帳

1	2021-01-01	香蕉	15
2	2021-01-01	橘子	20
3	2021-01-01	芭樂	30
4	2021-01-02	蘋果	10
5	2021-01-02	芒果	30
6	2021-01-02	橘子	45
7	2021-01-03	香蕉	20
8	2021-01-01	蓮霧	20
9	2021-01-03	蓮霧	35
10	2021-01-04	香蕉	30
11	2021-01-04	連霧	20
12	2021-01-04	蘋果	60
13	2021-01-05	香蕉	20
14	2021-01-05	芒果	50
15	2021-01-06	橘子	65

資料表的格式如下

id    int
日期  date
水果  varchar(20)
數量  int

讀取chartify資料

chartify裏有一份流水帳，可供後續樞紐分析。如果使用 df.values 以list的方式列印，注意第一欄位為日期格式，需使用
‘{:%Y-%m-%d}’.format(ls[0]) 來顯示所要的格式。另外, values必需使用pd.isnull()來判定是否為空值。不過我們在此直接使用 print()列印整個DataFrame。

import pandas as pd
import chartify
display=pd.options.display
display.max_columns=None
display.max_rows=None
display.width=None
display.max_colwidth=None
df = chartify.examples.example_data()
print(df)

結果如下
          date country   fruit  unit_price  quantity  total_price
0   2017-10-21      US  Banana    0.303711         4     1.214846
1   2017-05-30      JP  Banana    0.254109         4     1.016436
2   2017-05-21      CA  Banana    0.268635         4     1.074539
3   2017-09-18      BR   Grape    2.215277         2     4.430554
4   2017-12-08      US  Banana    0.308337         5     1.541687
5   2017-06-05      GB   Apple    0.870118         2     1.740235
6   2017-09-05      JP  Banana    0.279179         7     1.954252
7   2017-08-27      CA   Apple    1.025265         4     4.101059
8   2017-09-14      CA   Apple    1.078831         4     4.315324
9   2017-05-26      GB   Grape    1.840909         2     3.681818
10  2017-08-01      CA   Grape    2.106204         1     2.106204

搜尋

先產生一個mask，產生所有列的True or False資料，然後置為 df之中即可

mask=(df['日期']>='2017-12-20') & (df['日期']<='2017-12-31')
df=df[mask]

pandas pivot

先來認識一下pandas的pivot. 有人稱這是透視表, 台灣則稱為樞紐分析表. 相關的觀念請看資料表格式這一篇說明.

chartify裏的練習資料, 其實就是流水帳資料. 如果要轉化成人類能看懂的樞鈕分析表, 就要使用 pandas的pivot功能. pivot有四個重要的變數, 分別是index, values, columns, aggfunc

首先看一下index, 設定為 country, fruit, date, 則資料表就會依國家, 水果, 日期分類, 如下

table=df.pivot_table(index=['country', 'fruit', 'date'])
table.to_excel('test1.xlsx')
print(table)

                           quantity  total_price  unit_price
country fruit  date                                         
BR      Apple  2017-01-10       2.0     1.808778    0.904389
               2017-02-13       2.0     1.664359    0.832179
               2017-04-16       1.0     0.998892    0.998892
               2017-06-22       6.0     6.634887    1.105814
               2017-06-23       5.0     5.142000    1.028400
...                             ...          ...         ...
US      Orange 2017-12-21       3.0     1.226949    0.408983
               2017-12-22       3.5     1.836375    0.544316
               2017-12-23       3.0     1.506526    0.502175
               2017-12-27       4.0     1.643388    0.410847
               2017-12-30       2.0     1.058775    0.529387

[888 rows x 3 columns]

接下來是columns, 表示要顯示的欄位(Columns), 相對的, 就必需指定要顯示的值(Values). 以下列的例子, 說明了依國家為分類, 列出每天(列) 各種水果(欄)的單價

table=df.pivot_table(index=['country','date'], columns='fruit', values='unit_price')
table.to_excel('test2.xlsx')

fruit                  Apple    Banana     Grape    Orange
country date                                              
BR      2017-01-10  0.904389       NaN       NaN       NaN
        2017-02-13  0.832179       NaN       NaN       NaN
        2017-02-27       NaN       NaN       NaN  0.560445
        2017-03-29       NaN  0.235860       NaN       NaN
        2017-04-05       NaN       NaN  1.960156       NaN
...                      ...       ...       ...       ...
US      2017-12-27       NaN  0.275843       NaN  0.410847
        2017-12-28  0.954567       NaN       NaN       NaN
        2017-12-29       NaN  0.253700       NaN       NaN
        2017-12-30       NaN  0.277271       NaN  0.529387
        2017-12-31       NaN  0.236518       NaN       NaN

Pivot參數

pivot_table 預設計算所有資料的平均值，如果沒有資料預設填入 NaN。

如果要將 NaN顯示為 0，則需加入 fill_value=0。若不要顯示任何東西，則加入 fill_value=””。

若要計算總合的話，必需加入如下 aggfunc=’sum’。

table=df.pivot_table(index=['日期'], columns=['水果'], values=['數量'],fill_value="",aggfunc='sum')

MySQL樞紐分析

MySQL的SQL語法，也可以作樞紐分析，SQL語法如下。
請注意，如果同一日期若有二個國家以上的話，需使用 group_concat()將所有國家名串起來。

use cloud;
select 日期, group_concat(國家) as 國家,
sum(case when 水果 = 'Apple' then 數量 else 0 end) as 蘋果,
sum(case when 水果 = 'Banana' then 數量 else 0 end) as 香蕉,
sum(case when 水果 = 'Orange' then 數量 else 0 end) as 柳橙,
sum(case when 水果 = 'Grape' then 數量 else 0 end) as 葡萄, 
format(sum(單價*數量), 4) as 總價 
from fruit group by (日期)
order by 日期

上述的SQL語法，就算有上千萬資料筆數，都能秒殺產生如下的樞紐分析表。

日期            香蕉     蘋果    橘子    芭樂     蓮霧    芒果
2021-01-01	15	0	20	30	20	0
2021-01-02	0	10	45	0	0	30
2021-01-03	20	0	0	0	35	0
2021-01-04	30	60	0	0	0	0
2021-01-05	20	0	0	0	0	50
2021-01-06	0	0	65	0	0	0

Python代碼如下

import mysql.connector as mysql
import pandas as pd
display=pd.options.display
display.max_columns=None
display.max_rows=None
display.width=None
display.max_colwidth=None
conn=mysql.connect(
    host='mahaljsp.ddns.net',
    user='帳號',
    password='密碼',
    database='資料庫'
)
cursor=conn.cursor()
cmd="select * from fruit"
cursor.execute(cmd)
rs=cursor.fetchall()
cmd='''
select 日期,
sum(case when 水果 = 'Apple' then 數量 else 0 end) as 蘋果,
sum(case when 水果 = 'Banana' then 數量 else 0 end) as 香蕉,
sum(case when 水果 = 'Orange' then 數量 else 0 end) as 柳橙,
sum(case when 水果 = 'Grape' then 數量 else 0 end) as 葡萄,
format(sum(單價*數量), 4) as 總價
from fruit group by 日期
order by 日期
'''
cursor.execute(cmd)
data=cursor.fetchall()
columns=[d[0] for d in cursor.description]
df=pd.DataFrame(data=data, columns=columns)
conn.close()
print(df)

結果 : 
           日期  蘋果 香蕉  柳橙 葡萄   總價
0    2017-01-10   2    0    0    0   1.8088
1    2017-01-12   0    0    2    0   0.8296
2    2017-01-22   0    0    0    1   1.9985
3    2017-01-27   0    6    0    0   1.3908
4    2017-01-28   3    0    0    0   2.6586
5    2017-01-31   2    0    0    0   1.6686

Cashier

本站有另一個 cashier 資料庫，裏面的銷售明細裏有 1000 萬筆資料，使用 workbench 執行 SQL 語法產生樞扭分析表時會超時斷線發生 Error Code: 2013. Lost connection to MySQL server during query 錯誤。請進入 workbench 的 Edit/Preferences/SQL Editor，將 DBMS connection read timeout interval(in seconds) 由 30 秒改成 300秒，再由如下 SQL 語法執行。

在 Python 中使用 SQL 語法花費約 162 秒。

import time
from MahalSdk.sql import sql
conn=sql.conn()
cursor=conn.cursor()
cmd="""
select 日期,
sum(case when 商品編號 = '0001' then 數量 else 0 end) as 科學麵,
sum(case when 商品編號 = '0002' then 數量 else 0 end) as 台灣啤酒,
sum(case when 商品編號 = '0003' then 數量 else 0 end) as 可樂果,
sum(case when 商品編號 = '0004' then 數量 else 0 end) as 魷魚絲, 
sum(case when 商品編號 = '0005' then 數量 else 0 end) as 話匣子, 
sum(case when 商品編號 = '0006' then 數量 else 0 end) as 布丁, 
sum(case when 商品編號 = '0007' then 數量 else 0 end) as 牛肉麵, 
sum(case when 商品編號 = '0008' then 數量 else 0 end) as 來一客, 
sum(case when 商品編號 = '0009' then 數量 else 0 end) as 烏骨雞, 
sum(case when 商品編號 = '0010' then 數量 else 0 end) as 黑松汽水, 
format(sum(售價*數量), 0) as 總價  
from 銷售明細view 
group by 日期 
order by 日期
"""
#from (select * from 銷售明細view where 日期 >='2024-01-01' and 日期 <'2024-03-01') temp group by 日期 order by 日期
t1=time.time()
cursor.execute(cmd)
t2=time.time()
rs=cursor.fetchall()
for r in rs:
    print(r)
print(f"花費時間 : {t2-t1}秒")

如果使用 Pandas pivoit 花費約 169 秒左右。

import pandas as pd
import time
from MahalSdk.sql import sql

conn=sql.conn()
cursor=conn.cursor()
t1=time.time()
cursor.execute("select * from 銷售明細view")
rs=cursor.fetchall()
columns=[d[0] for d in cursor.description]
#print(rs)
df=pd.DataFrame(data=rs, columns=columns)
table=df.pivot_table(index=['日期'], columns=['品名'], values=['數量'],fill_value="",aggfunc='sum')
t2=time.time()
print(f"花費時間 : {t2-t1}秒")
table.to_excel("total.xlsx", sheet_name="Sheet1")
#print(table)

流水帳與樞紐分析