Skip to content

Latest commit

 

History

History
executable file
·
63 lines (60 loc) · 3.07 KB

File metadata and controls

executable file
·
63 lines (60 loc) · 3.07 KB

Manage and process 44GB data on Spark

Task description

Clean and extract informations form VK data(a social application) (see task_description)
Data size: 44GB
Sample data size: 16GB
Platform: pyspark 2.2.0 in ubuntu 16.04 LTS

How to run

Uncomment the task functions below from vk_project:
runtasks

NOTE

Since data is pretty large, only sample result(20 rows usually) is presented. And if you wanna check full result, kindly please go run the vk_project.

Results for level basic

  1. count of comments, posts (all), original posts, reposts and likes made by user
    sample count of comm per user
    • COMMENTS COUNT
      countcomm
    • ALL POSTS COUNT
      allposts
    • ORIGINAL POSTS COUNT
      originalposts
    • REPOSTS COUNT
      reposts
    • LIKES COUNT
      likescount
  2. count of friends, groups, followers
  3. count of videos, audios, photos, gifts
    • COMBINED COUNTS
      videosgroupsetc
  4. count of "incoming" (made by other users) comments, max and mean "incoming" comments per post
    • INCOMING COMMENTS STATS:
      incmingcommstats
  5. count of "incoming" likes, max and mean "incoming" likes per post
    • INCOMING LIKES
      incominglikesstats
  6. count of geo tagged posts
    • Count of geo tagged posts
      geotaggedposts
  7. count of open / closed (e.g. private) groups a user participates in
    • Count of opened closed
      coungopenandclosedgroup

Results for level medium

  1. count of reposts from subscribed and not-subscribed groups

    • COUNTS OF REPOSTS FROM SUB AND NONSUB GROUPS
      countsofsubandnonsub
  2. count of deleted users in friends and followers

    • COUNT OF DELETED USER
      countdeluser
  3. Aggregate (e.g. count, max, mean) characteristics for comments and likes (separtely) made by (a) friends and (b) followers per post

    • LIKE PER POST FROM FOLLOWERS AND FRIENDS
      likeperpostFOLandFRI
    • COMMENTS PER POST FROM FOLLOWERS AND FRIENDS
      commsperpostFOLandFRI
  4. Aggregate (e.g. count, max, mean) characteristics for comments and likes (separtely) made by (a) friends and (b) followers per user

    • LIKE PER USER FROM FOLLOWERS AND FRIENDS
      likeperuserFOLandFRI
    • COMMENTS PER POST FROM FOLLOWERS AND FRIENDS
      commsperpostFOLandFRI
  5. find emoji (separately, count of: all, negative, positive, others) in (a) user's posts (b) user's comments

    • EMOJI CLASSIFICATIONS COUNT
      emojicountcombined