NBC Universal turned to Spark to analyze all the content meta-data for its international content distribution. Metadata associated with the media clips is stored in an Oracle database and in broadcast automation playlists. Spark is used to query the Oracle database and distribute the metadata from the broadcast automation playlists into multiple large in-memory resilient distributed datasets (RDDs). One RDD stores Scala objects containing media IDs, time codes, schedule dates and times, channels for airing etc. It then creates multiple RDDs containing broadcast frequency counts by week, month, and year and uses Spark’s map/reduceByKey to generate the counts. The resulting data is bulk loaded into HBase where it is queried from a Java/Spring web application. The application converts the queried results into graphs illustrating media broadcast frequency counts by week, month, and year on an aggregate and a per channel basis.
…
NBC Universal runs Apache Spark in production in conjunction with Mesos, HBase and HDFS and uses Scala as the programming language. The rollout in production happened in Q1 2014 and was smooth.
Apache Spark Improves the Economics of Video Distribution at NBC Universal – Databricks
Shit’s bonkers out there. If I’d have proposed that to an architect “back in my day,” they’d have told me to go shot myself. They’d say: “uh, so, how about we just make a database table and ETL tool that does that?”
The last part - all those different things used - is amazing. Again, the architect would say: “we write things in Java. Try again.”
Granted, the point is: things like Spark and friends let you move beyond dealing with just tidy data and analitects. But, still, sloppy is as sloppy does, right?