Skip to content

README.md

README

The data is derived from Freebase Data Dumps: Google, Freebase Data Dumps, https://developers.google.com/freebase/data, Nov 10, 2015.

  • film-entities.txt.gz contains film related entities, one entity per line (not RDF). Contains 4.6 million entities.
  • names.gz contains type.object.name RDF entries for film related entities. Contains 4.1 million names (edges) in various languages.
  • entitylist.gz contains film related entities occuring in subject field, with valid type.object.name entries, plus film.performance mediators. Used to generate rdf-films.gz. [~3.5 million]
  • rdf-films.gz contains the RDF data for all the entities which have a name entry in names.gz. Currently contains ~18 million edges.
  • langnames.gz contains names of primary languages. [2084 edges]
  • countrynames.gz contains names of countries. [12381 edges]
$ zcat rdf-films.gz | wc -l
17910696

Film data from Freebase is like so:

# film.film --{film.film.starring}--> [mediator] --{film.performance.actor}--> film.actor
# Film --> Mediator
$ zgrep "<film.film.starring>" rdf-films.gz | wc -l
1397647

# DONE: The following is an artifact of how data was filtered from freebase. ~~Fix this.~~
# Fixed!
# Mediator --> Actor
$ zgrep "<film.performance.actor>" rdf-films.gz | wc -l
1396420

# Film --> Director
$ zgrep "<film.film.directed_by>" rdf-films.gz | wc -l
242212

# Director --> Film
$ zgrep "<film.director.film>" rdf-films.gz | wc -l
245274

# Film --> Initial Release Date
$ zgrep "<film.film.initial_release_date>" rdf-films.gz | wc -l
240858

# TODO: Generate Year of Release --> Film

# Film --> Genre
$ zgrep "<film.film.genre>" rdf-films.gz | wc -l
548152

# Genre --> Film
$ zgrep "<film.film_genre.films_in_this_genre>" rdf-films.gz | wc -l
546698

# TODO: film.film.runtime points to a mediator node, but then leads to nowhere. Remove this data.

# Film --> Primary Language
$ zgrep "<film.film.primary_language>" rdf-films.gz | awk '{print $3}' | uniq | sort | uniq | wc -l
55

# TODO: Generate Primary Language --> Film.

# Generated language names from names freebase rdf data.
$ zcat langnames.gz | awk '{print $1}' | uniq | sort | uniq | wc -l
55

# Film --> Country
$ zgrep "<film.film.country>" rdf-films.gz | wc -l
255239

# Total number of countries.
$ zgrep "<film.film.country>" rdf-films.gz | awk '{print $3}' | uniq | sort | uniq | wc -l
304

# Generated country names from names freebase rdf data.
$ zcat countrynames.gz | awk '{print $1}' | sort | uniq | wc -l
304

# Content Rating --> Film
$ zgrep "<film.content_rating.film>" rdf-films.gz | wc -l
24266

# Film -> Content Rating
$ zgrep "<film.film.rating>" rdf-films.gz | wc -l
24266

Filter out data

zgrep -v "<kg\." rdf-films.gz | grep -v "http://www.w3.org/1999/02/22-rdf-syntax-ns#type" | less

Generate 1M edges

$ zcat rdf-films.gz | grep "<film.film.directed_by>\|<film.director.film>\|<film.film.initial_release_date>\|<film.film.country>\|<film.film.primary_language>\|<film.film.rating>" | gzip > selective.gz
Something went wrong with that request. Please try again.