Contents

数据可视化 电影数据分析

数据可视化 电影数据分析

说明

这次的文章参考自 https://zhuanlan.zhihu.com/p/35453189

提出问题

简单分析制作一部电影时,应考虑哪些客观因素才能使电影成功? 为客户提供有效的数据一句,以便作出更准确 的决策 本次数据分析报告主要围绕一下几点进行分析

  1. 问题一:电影类型如何随时间发生变化?

    • 电影数量上的对比
    • 电影收入的对比
  2. 问题二:影响电影收入的客观因素有哪些? 这个不会,先放一放

  3. 问题三:两家电影公司Universal Pictures 和 Paramount Pictures之间的对比。

    • 电影发行数量上的对比
    • 电影产生的利润对比
  4. 问题四:改编电影和原创电影之间的对比。

    • 电影发行数量上的对比
    • 电影产生的利润对比

理解数据

重点关注的变量有:

  • imdb_id IMDB 标识号
  • popularity 在 Movie Database 上的相对页面查看次数
  • budget 预算(美元)
  • revenue 收入(美元)
  • original_title 电影名称
  • cast 演员列表,按 | 分隔,最多 5 名演员
  • homepage 电影首页的 URL
  • director 导演列表,按 | 分隔,最多 5 名导演
  • tagline 电影的标语
  • keywords 与电影相关的关键字,按 | 分隔,最多 5 个关键字
  • overview 剧情摘要
  • runtime 电影时长
  • genres 风格列表,按 | 分隔,最多 5 种风格
  • production_companies 制作公司列表,按 | 分隔,最多 5 家公司
  • release_date 首次上映日期
  • vote_count 评分次数
  • vote_average 平均评分
  • release_year 发行年份
  • budget_adj 根据通货膨胀调整的预算(2010 年,美元)
  • revenue_adj 根据通货膨胀调整的收入(2010 年,美元)

导入数据

using MLJFlux, Flux, MLJ, DataFrames, CSV, StatsBase, Dates
import JSON
creditsdata = CSV.read("data/movies/tmdb_5000_credits.csv", DataFrame)
moviesdata = CSV.read("data/movies/tmdb_5000_movies.csv", DataFrame)

select!(creditsdata, Not(:title))
fulldata = hcat(creditsdata, moviesdata)

使用 excel 查看数据

查看缺失值情况

for column in names(fulldata)
  missingcount = count(ismissing, fulldata[!, column])
  println("$column: \t $missingcount")
end

发现只有以下字段有缺失值

  • release_date
  • runtime

数据清洗

选择子集

columns = [:id, :title, :vote_average, :production_companies, :genres,
	   :release_date, :keywords, :runtime, :budget, :revenue, :vote_count, :popularity]
fulldata = select(fulldata, columns)

缺失值处理

fillReleaseDate(dataframe::DataFrame) = begin
  mapdate(date::Union{Missing, Date}) = begin
    if ismissing(date)
      return Date("2014-06-01")
    else
      return date
    end
  end

  dataframe[!, :release_date] = map(mapdate, dataframe[!, :release_date])
  return dataframe
end

fillRuntime(dataframe::DataFrame) = begin
  meanvalue = mean(skipmissing(dataframe[!, :runtime]))
  mapruntime(runtime::Union{Missing, Float64}) = begin
    if ismissing(runtime)
      return meanvalue
    else
      return runtime
    end
  end

  dataframe[!, :runtime] = map(mapruntime, dataframe[!, :runtime])
  return dataframe
end

数据类型转换

  • 在时间序列中提取年份

    generateReleaseYear(dataframe::DataFrame) = begin
      dataframe[!, :release_year] = map(year, dataframe[!, :release_date])
      return dataframe
    end
    
  • 将电影类型添加到列,需进行one-hot编码

    function generateGenreType(dataframe::DataFrame)
      len = first(size(dataframe))
    
      for column in genrelist
        dataframe[!, column] = zeros(len)
        for row in eachrow(dataframe)
          if contains(row.genres, column)
    	row[column] = 1
          end
        end
      end
    
      return dataframe
    end
    
  • 用年份索引

    function sortByReleaseYear(dataframe::DataFrame)
      sort(dataframe, [:release_year])
    end
    

数据转换

featureSelector = FeatureSelector(
  features = [:release_date],
  ignore = true
)

transformModel = Pipeline(
  fillReleaseDate,
  fillRuntime,
  generateReleaseYear,
  featureSelector,
  generateGenreType,
  sortByReleaseYear
  # generateName
)

transformMachine = machine(transformModel, fulldata)

fit!(transformMachine)
transformedData = MLJ.transform(transformMachine, fulldata)

问题分析

电影类型随时间变化

  • 提取电影类型

    function fetchGenreList(dataframe::DataFrame)
      mapfn(array::Vector{Any}) = map(x -> x["name"], array)
      genrelist = Set{String}()
      jsons = map(JSON.parse, dataframe[!, :genres])
      for json in jsons
        names = mapfn(json)
        for name in names
          push!(genrelist, name)
        end
      end
      return genrelist
    end
    
    const genrelist = fetchGenreList(fulldata)
    
  • 对每个类型的电影按年份求和

    function groupByReleaseYear(dataframe::DataFrame)
      dataframes = groupby(dataframe, :release_year)
      years = Int[]
      counts = Int[]
      for _dataframe in dataframes
        year = first(_dataframe.release_year)
        count = first(size(_dataframe))
    
        push!(years, year)
        push!(counts, count)
      end
    
      bar(years, counts, xticks = :all, size = figuresize) |> display
    end
    
    groupByReleaseYear(transformedData)
    

  • 汇总各电影类型的总量

    function groupByEachGenre(dataframe::DataFrame)
    
      record = Dict{String, Int}()
      for genre in genrelist
        record[genre] = 0
      end
    
      for row in eachrow(dataframe)
        for genre in genrelist
          record[genre] += row[genre]
        end
      end
    
      xs = collect(keys(record))
      ys = collect(values(record))
    
      bar(xs, ys, xticks = :all, size = figuresize) |> display
    end
    
    groupByEachGenre(transformedData)
    

  • 电影类型随时间的变化

    function plotGenreAndTime(dataframe::DataFrame)
      columns = ["Drama","Comedy","Thriller","Action","Romance","Adventure",
    	     "Crime","Science Fiction","Horror","Family", "release_year"]
      _dataframes = groupby(select(dataframe, columns), :release_year)
      # p = plot()
    
      # record: Dict{year, Dict{Name, Count}}
      record = Dict{Int, Dict{String, Int}}()
      for _dataframe in _dataframes
        # years
        # counts
        year = first(_dataframe.release_year)
        record[year] = Dict{String, Int}()
        for column in columns[columns .!= "release_year"]
          record[year][column] = reduce(+, _dataframe[!, column])
        end
      end
    
      _years = collect(keys(record))
      _countmaps = collect(values(record))
      indexs = sortperm(_years)
    
      years = _years[indexs]
      countmaps = _countmaps[indexs]
    
      p = plot(size = figuresize, xticks = :all)
      for column in columns[columns .!= "release_year"]
        counts = map(x -> x[column], countmaps)
        plot!(p, years, counts, label = column, xticks = :all)
      end
    
      plot(p) |> display
    end
    
    plotGenreAndTime(transformedData)
    

Universal Pictures和Paramount Pictures之间的对比

  • 电影发行量对比

    function plotCompareTotal(dataframe::DataFrame)
      dataframe[!, "Universal Pictures"] = map(s -> contains(s, "Universal Pictures") ? 1 : 0, dataframe[!, :production_companies])
      dataframe[!, "Paramount Pictures"] = map(s -> contains(s, "Paramount Pictures") ? 1 : 0, dataframe[!, :production_companies])
    
      universalTotal = reduce(+, dataframe[!, "Universal Pictures"])
      paramountTotal = reduce(+, dataframe[!, "Paramount Pictures"])
      total = universalTotal + paramountTotal
    
      xs = ["Universal Pictures", "Paramount Pictures"]
      ys = [universalTotal / total, paramountTotal / total]
      pie(xs, ys, aspect_ratio = 1.0) |> display
    
      companyDifference = groupby(select(dataframe, vcat(xs, "release_year")), :release_year)
      # record: Dict{Year, Dict{Company, Int}}
      record = Dict{Int, Dict{String, Int}}()
      for _dataframe in companyDifference
        year = first(_dataframe.release_year)
        record[year] = Dict{String, Int}()
        for column in xs
          count = reduce(+, _dataframe[!, column])
          record[year][column] = count
        end
      end
    
      _years = collect(keys(record))
      _countmaps = collect(values(record))
      indexs = sortperm(_years)
    
      years = _years[indexs]
      countmaps = _countmaps[indexs]
    
      p = plot(size = figuresize)
      for column in xs
        counts = map(x -> x[column], countmaps)
        plot!(p, years, counts, label = column, xticks = :all)
      end
    
      plot(p) |> display
    end
    
    plotCompareTotal(transformedData)
    

  • 利润对比

    function plotCompareProfit(dataframe::DataFrame)
      dataframe[!, :profit] = dataframe[!, :revenue] .- dataframe[!, :budget]
      dataframe[!, "Universal Profit"] = dataframe[!, "Universal Pictures"] .* dataframe[!, :profit]
      dataframe[!, "Paramount Profit"] = dataframe[!, "Paramount Pictures"] .* dataframe[!, :profit]
    
      universalProfit = reduce(+, dataframe[!, "Universal Profit"])
      paramountProfit = reduce(+, dataframe[!, "Paramount Profit"])
      totalProfit = universalProfit + paramountProfit
    
      xs = ["Universal Profit", "Paramount Profit"]
      ys = [universalProfit / totalProfit, paramountProfit / totalProfit]
      pie(xs, ys) |> display
    
      companyDifference = groupby(select(dataframe, vcat(xs, "release_year")), :release_year)
      # record: Dict{Year, Dict{Company, Number}}
      record = Dict{Int, Dict{String, Number}}()
      for _dataframe in companyDifference
        year = first(_dataframe.release_year)
        record[year] = Dict{String, Number}()
        for column in xs
          profit = reduce(+, _dataframe[!, column])
          record[year][column] = profit
        end
      end
    
      _years = collect(keys(record))
      _profitmaps = collect(values(record))
      indexs = sortperm(_years)
    
      years = _years[indexs]
      profitmaps = _profitmaps[indexs]
    
      p = plot(size = figuresize)
      for column in xs
        profits = map(x -> x[column], profitmaps)
        plot!(p, years, profits, label = column, xticks = :all)
      end
    
      plot(p) |> display
    end
    
    plotCompareProfit(transformedData)
    

改编电影和原创电影的对比

  • 数量对比

    function plotCompareOriginal(dataframe::DataFrame)
      column = "is original"
      dataframe[!, column] = map(x -> contains(x, "based on novel") ? 0 : 1, dataframe[!, :keywords])
    
      keycount = countmap(dataframe[!, column])
      total = keycount[0] + keycount[1]
      xs = ["is original", "not original"]
      ys = [keycount[1] / total, keycount[0] / total]
    
      pie(xs, ys) |> display
    end
    
    plotCompareOriginal(transformedData)
    

  • 平均利润对比

    function plotCompareProfit(dataframe::DataFrame)
      column = "is original"
      (notoriginalDataframe, originalDataframe) = groupby(select(dataframe, [column, "profit"]), column)
      # record: Dict{is original, profit}
      originalProfit = reduce(+, originalDataframe[!, :profit])
      notoriginalProfit = reduce(+, notoriginalDataframe[!, :profit])
      originalCount = first(size(originalDataframe))
      notoriginalCount = first(size(notoriginalDataframe))
      bar(xs, [originalProfit / originalCount, notoriginalProfit / notoriginalCount], size = figuresize) |> display
    end
    
    plotCompareProfit(transformedData)