Scraping with OutWit Hub

I wrote the below post for my introductory data journalism class at the University of Tennessee. It provides an example of how to use OutWit Hub to scrape information for numerous pages structured in the same manner.

————

The other day during class I scraped data the Congressional Medal of Honor Society’s website using Outwit Hub. Ahead of today’s assignment, I figured I would pull together a guide on how I did it.

Step 1. We need to decide what we want and where it is.

I want some descriptive data about living medal of honor recipients in order to provide some context to the reporting we are doing at the Medal of Honor Project. Specifically, I want the recipients’ name, rank, date of birth, date of medal-winning action, place of action, MoH issue date, place of birth and place of action.

If I choose “Living Recipients” from the “Recipients” tab, I see this:

Screen Shot 2014-03-03 at 4.20.44 PM

If this screen had all the information I needed, I could easily use the Chrome Scraper extensions to grab the data. Unfortunately, I want more information than their name, rank, organization, and conflict. If I click on one of the entries, I can see that all the information that I want is on each entry page.

mohscrape

 

So now we know that we want to grab a handful of data from each of the pages of the living recipients.

Step 2. Collect the addresses of all the pages we want our scraper to go to (i.e., the pages of all the living recipients).

We can do this in a number of ways. Since there are only 75 living recipients, across three pages, we could easily use the Chrome Scraper Extension to grab the addresses (see this guide if you forget how to use it).

Screen Shot 2014-03-03 at 4.45.28 PM 

Since I am using this project as practice for grabbing data from the pages of all 3,463 recipients, I decided to write a scraper in Outwit to grab the addresses.

To write a scraper, I need to tell the program exactly what information I want to grab. I start this process by looking at the coding around the items I want using the “Inspect Element” function in Google Chrome.

Screen Shot 2014-03-03 at 5.01.04 PM

If I right-mouse click on the “view” link and click “Inspect Element,” I will see that this is the line of code that relates to the link:

<div class="floatElement recipientView"><a href="http://www.cmohs.org/recipient-detail/3219/baca-john-p.php">view</a></div>

This line of code is all stuff we have seen before. This just a <div> tag with an <a> tag inside it. The <div> is used to apply a class (i.e., floatElement recipientView) and the <a> inserts the link. The class is unique to the links we want to grab, so we can use that in our scraping. We just need to tell Outwit Hub to grab the link found within any <div> tag of the recipientView class.

In Outwit, we start by loading the page we want to scrape.

Screen Shot 2014-03-03 at 5.17.34 PM

Then we want to start building our scraper by choosing “Scrapers.” When we click into the scraper window, we will have to pick a name for our scraper. I chose “MoH Links.” You will also see that the CMOHS website has flipped in to a code view. We will enter the directions for our scraper in the lower half of the screen, where it says description, marker before and marker after.

Screen Shot 2014-03-03 at 5.23.58 PM

We just need one bit of info, so our scraper in simple. I entered:

  • Description = Link
  • Marker before = recipientView”><a href=”
  • Marker after = “>

You can then hit “Execute” and your scraper should grab the 25 addresses from the first page of living recipients. But remember, I don’t want the addresses from just the first page, but from all three pages.

To do this, I need to step back, get super meta, and create a list to make a list. If you go to the second page, it is easy to see how these pages are organized or named. Here is the address for the second page of recipients:

http://www.cmohs.org/living-recipients.php?p=2..

Not shockingly “p=2″ in english is “page equals two.” A list of all the address is simple to derive.

http://www.cmohs.org/living-recipients.php?p=1..

http://www.cmohs.org/living-recipients.php?p=2..


http://www.cmohs.org/living-recipients.php?p=3..

If you create this list as a simple text file (.txt), we can bring this into Outwit Hub and use our scrape on all of these pages.  After I create the text file, I go to Outwit choose “File > Open” and select the text file. Next, select “Links” from the menu on the right-hand side of the screen. It should look like this:

Screen Shot 2014-03-03 at 5.40.21 PM

Now, select all the links by using Command+A. Then right-mouse click and choose “Auto-Explore Pages > Fast Scrape > MoH Links (or whatever you named your scraper).” OutWit should pop out a table that looks should of like this:

Screen Shot 2014-03-03 at 5.51.05 PM

YOU JUST RAN YOUR FIRST SCRAPER!!!

Way to go!

Now just export these links. You can either right-mouse click and select “Export selection” or click “Catch” and then hit “Export.” I usually export as a Excel file. We’ll eventually have to turn this file in to a text file, so we can bring it back into OutWit. For now, just export it and put it to the side.

Step 3. Create a scraper for the data we actually want.

We are going to start with “Element Inspector” again. Remember, we want to find unique identifiers related to each bit of information we want to grab. I went in an look at each piece of information (e.g., Issue Date) and looked at the coding around the information.

Screen Shot 2014-03-03 at 6.01.04 PM

If you run through each of the bits of information we are grabbing, you start seeing a pattern in the way the information is coded and unique identifiers for each piece of information. For example, the code around the “Date of Issue” looks like this:

<div><span>Date of Issue:</span> 05/14/1970</div>

And it looks like that on every page I need to scrape. So I can enter the following information into a new OutWit scraper – I called this one MoH data – in order to grab the date:

  • Description = IssueDate
  • Marker before = Issue:</span>
  • Marker after = </

OutWit will grab the date (i.e., 5/14/1970) which is all the information between the “>” after the span to the “</” which closes the span.

Just about every piece of information we want has a label associated, which makes it very easy to scrape. I just went through and created a line in OutWit for each piece of data I wanted, using the label as the marker before.

The only piece of information that doesn’t have a label is the name. If you right-mouse click on it and choose “Inspect Element,” you will see that it is surrounded by an <H4> tag. If you use the Find function (command+F), you’ll see that the name is the only item that has an <H4> tag associated with it. So we can tell OutWit to grab all information between an <H4> tag, like so:

  • Description = Name
  • Marker before = <H4>
  • Marker after = </

Screen Shot 2014-03-03 at 6.29.01 PM

Once I got my scraper done, I hit the “Execute” button to see if it worked. It did!

Step 4. Now I just need to tell OutWit where to use my new scraper.

Go back to the Excel file you create in Step 2. Copy the column of links and paste them into a new text file. Save this new text file. I called mine mohlinks2.txt.

Screen Shot 2014-03-03 at 6.32.13 PM

Next we open up OutWit. Before actually start scraping we need to deal with a limitation of the free version of OutWit. You can only have one scraper assigned to a given web address in the free version. So we need to change “MoH Links” (our first scraper), so it is not associated with cmohs.org.

Open up “MoH Links” on the “Scrapers” page of OutWit. Below where it says “Apply If Page URL Contains” there is a box the contains “http://www.cmohs.org.” Delete the address from that box and save the new “MoH Links” scraper. Now go into the “MoH Data” scraper and enter the cmohs address in the same box, save the scraper, and then close and reopen OutWit.

Screen Shot 2014-03-03 at 6.44.43 PM

Next open the mohlinks2.txt. Select all the links (command+A) and choose “Auto-Explore Pages > Fast Scrape > MoH Data (or whatever you named your scraper).” Slowly but surely OutWit Hub should go to each of the 75 pages in our links text file and grab the bits of information we told it to grab. Mine worked perfect.

All that you need to do is export the data OutWit collected and then you can go into to Excel to start cleaning the data and pulling information from the data.

Although this first one probably seemed a bit rough, you will get used to how information is structured in websites and how OutWit works over time.

Video Production – In-class Assignment

Today for the first half of class, we will run through basic operation of a DSLR for video production, and then each of you (in groups) will shoot a short video. The video will be a companion piece to a story being produced by your organization about Land Grant Films, a documentary production house being created at the University of Tennessee.

The author of the print piece has provided you with the press release she received from the university.

Oct. 28, 2015

UT Professor Launches Philanthropic Documentary Brand

KNOXVILLE — The Medal of Honor Project, a collaboration between University of Tennesee, Knoxville, School of Journalism and Electronic Media and the 2014 Medal of Honor Convention, sparked an interest in directing for a UT journalism professor.

Nick Geidner, assistant professor of journalism, is launching and directing a new project he’s calling Land Grant Films, a logical extension of the Medal of Honor project, also directed by Geidner. The project is based in the School of Journalism and Electronic Media.

“My goal with Land Grant Films is to provide students with real world documentary
experience while getting them engaged in organizations and issues that affect the community,” said Geidner.

The project will provide students with real-world experience in documentary storytelling and also give local non-profits video assets that can be used to raise awareness and funds for their cause.

Land Grant Films already has several projects in the works. They are working on films for several local organizations, including the Boys and Girls Club of the Tennessee Valley, Metropolitan Drug Commission, Tennessee Paracycle Open and Joy of Music School.

“Students involved in our films get to work on all aspects of the production, from running camera and conducting interviews to scripting and editing the film,” said Geidner. “It is an intense, hands-on experience that gets the students ready for a job in the video production field.”

For more information on Land Grant Films, visit www.landgrantfilms.org.

# # #

A producer has reached out to Geidner and he has agreed to an interview at his office. He is very, very important, so you will only have 15 minutes to interview him and shoot b-roll all the necessary b-roll.

I will post each groups’ raw video to Vimeo. From the raw video you will write a script.

Video storytelling

Boyd Huppert is unquestionably one of the best video storytellers in the business. As a matter of fact just last week, Huppert won two Murrow Awards (writing and feature story). Here is one of his most recent stories, part of his Land of 10,000 Stories series:

Then here is his Murrow-award winning feature story (also part of the Land of 10,000 Stories series):

Screen Shot 2015-10-21 at 3.40.18 PM

But broadcast journalism is not only good at feature stories. It can also be used for important investigative journalism. “Injured Heroes, Broken Promises” by KXAS and the Dallas Morning News investigates failures at the Warrior Transition Unit at Fort Hood. It won the 2014 SPJ Sigma Delta Chi and 2015 Murrow Award for investigative journalism.

BuzzFeed founder’s email about NBCUniversal investment

Here’s the full email Jonah Peretti, BuzzFeed’s founder and CEO, sent to BuzzFeed staff about the NBCUniversal investment.

Hello BuzzFeeders, 
I’m very excited to share that NBCUniversal has agreed to invest $200 million in BuzzFeed and partner with us to extend our reach to TV and Film. NBCU is the home of the Today Show, Jurassic World, the Minions, the Olympics, Jimmy Fallon and much more and we are looking forward to collaborating with them on projects we’d never be able to do on our own.

  
We’ve also signed an agreement with Yahoo! JAPAN to launch BuzzFeed Japan as a joint venture based in Tokyo. Yahoo! JAPAN is the leading digital media company in a huge market, reaching almost all of the online population in Japan. Partnering with them allows us to grow much more quickly in Japan than if we launched on our own. You can read more about our strategy from Greg’s blog post.
Additionally, we’ve executed a series of partnerships with the leading digital platforms, including Facebook’s Instant Articles, Snapchat’s Discover, Apple’s forthcoming News app, with more to come. These partnerships allow us to reach a bigger audience and have a bigger impact than what would be possible on our own. 
All these deals were structured to assure BuzzFeed’s continued editorial and creative independence. Equally important, the investment from NBCU and our rapidly growing revenue assures our financial independence, allowing us to grow and invest without pressure to chase short term revenue or rush an IPO. Our independence and a long term focus align us with our readers and viewers and help us deliver the best possible service for our audience. 
This is also great news for all BuzzFeed employees. There will be many opportunities for career development and growth as we expand in new areas and take on new challenges. Your work will have a bigger impact than ever before, spreading to more countries, across more platforms, in more formats. As a result of these deals, the work you are doing will play a bigger role in the lives of an even larger, more diverse, global audience. 
I’m sure you have lots of questions and I encourage you to submit them anonymously here. Today at 12p EST we’ll have a Global All Hands where I’ll answer your submitted questions and will be able to take live questions in NYC. Tonight at 7:30pm EST, I will do an all hands for the Sydney office. You should have a calendar invite with all the necessary information. My team and I will also answer questions in the new Slack channel #AMA at 3pm EST if you have more questions. If you don’t have Slack, click HERE to sign up and email helpdesk with any problems. 
One final point that is very important. None of this would be possible without the amazing work that all of you have done building BuzzFeed. Your inspired work in news and entertainment, tech and product, business and sales, across the U.S. and in countries around the world, has resulted in something truly remarkable and unexpected. So thank you and I can’t wait to be surprised and amazed by what you create next. 
Now let’s go have some fun! 
Jonah

Via @chrisgeidner

Deadline online graphic

As a class (or group; depending on how many people show up), you must create an interactive graphic to go along with a story about income disparities in America.

The print article is not finished yet, but the author sent you over the following info:

  • the article draws heavily from this report by the U.S. Census bureau
  • the author talks generally about disparities, but also focuses on gender and race disparities and how these have changed over time

Your editor wants to post this as a quick, web story in the next couple hours. The graphics editor has suggested you use Juxtapose.JS as a novel way to demonstrate income disparities.

You have the rest of class to create something and post it to GitHub.

HINTS:

  1. Juxtapose needs two similarly sized images.
  2. You can host image on Google Drive by (1) uploading it, (2) setting the privacy option to “anyone on the web,” and (3) using the following address https://drive.google.com/uc?id=0B2McuVJ6osMBaWc2dTEyWFBYU1U” Just replace the last crazy number with the file ID for your image.

Homework for Tuesday, April 14

For Tuesday you will recreate this webpage. You will use the following:

  1. The U.S. Department of Education’s Equity in Athletics dataset
  2. Excel – to clean and prep the data
  3. Refine or Excel Tableau Plugin (PC only) – to format the data
  4. Tableau Public – to build the graphics
  5. Text editor – to code the webpage
  6. GitHub – to host the webpage

I have provided hints for each step in the process below. Try to do it yourself, but if you really get stuck use the hints.

url

Step 1 Hint

Think of what you need to complete the graphic.

  • Data from 2003 to 2013 on every sport (male and female) for a single school (i.e., UTK)

Now you know what you need you should be able to find it pretty easy using the “Download selected data” tool on the Equity in Athletics page.

Step 2 Hint

Again ask yourself, “What do I need?”

We just need the data for each year for each sport, so we can dump a lot of data in the default dataset, like all the sports UT doesn’t have, the totals for each sport and all of Row 1.

Once we get rid of all the extraneous data, we just need to do one extra thing: transpose the rows and columns. We need to do this as step one of the process of formatting the data for Tableau.

Here are instructions on transposing data in Excel.

Extra hints: (1) We are still missing one piece of data. We need a variable for gender, so we can color code the boxes. It can just be a simple binary variable (i.e., 1 and 0). (2) We should also shorten the names of each sport in Excel. It is much easier to do it here than in Tableau or Refine. So instead of “Baseball Men’s Team Expenses” cut it down to “Baseball.”

Step 3 and Step 4 Hint

This video will walk you through the process of designing the graphic in Tableau Public. It will also explain how to format the data using the Tableau Excel Plugin, which is PC only. If you are using a Mac, you will have to use Google Refine to reshape the data. Here is where you can download Refine, and here is how to use it to reshape the data into the format we need.

Step 5 Hint

We need to build a simple website with three elements.

  1. A headline
  2. Body text
  3. Graphic

The graphic is easy. We use the embed code from Tableau. The headline and body text are also pretty simple. Remember we can add style to any tag. So for the headline, we can just place a <div> tag before it with style, like so:

<div style="font-family:Georgia;font-size:300%">The Ever Growing UT Athletics Budget</div>

Then we can do the same thing with the body copy, but changing the style applied.

Step 6 Hint

Remember, to create a project page on GitHub, all we have to do is create a “gh-pages” branch and then add an “index.html” to that branch.

Here are instructions from GitHub.

 

Homework

For Thursday, I want you to adapt the code we looked at in class today. Use this code as a basis. Here are the things I want you to do.

1) Add three more data points. Make up the data. Play with numbers much higher than our current scale.

2) Change the color of the axes. Make them red.

3) Make the radius of the circles scaled based on the number of Pulitzers winners and finalists.

 

Using D3.js – part 2

 

Last week, we made a simple interactive bar chart using the javascript library, D3.js. Today, we are going to build on this by making a simple scatterplot and adding axes. We will build this from the beginning to review all steps of the process.

First we need to our data. We will be using come data compiled by Nate Silver’s 538. In his story “Do Pulitzers Help Newspapers Keep Readers,” Nate Silver uses a scatterplot to look at the relationship between the number of Pulitzer winners and finalists and newspapers’ change in circulation during the last 10 years.

We are going to pull their data and look at it in a slightly different way. Instead of looking at the change in circulation, we are just going to look at current circulation and compare it to the number of Pulitzer winners and finalists during the last 10 years. I guess our headline would be “Do the Biggest Newspapers Win the Most Pulitzers.”

We can grab the data they compiled from their GitHub site. We could actually just tell D3 to read a CSV file and point it to the raw file hosted on Github, but that adds a few extra steps beyond the scope of this example.

Regardless, I have pulled together the data we need. Remember, we start by creating a variable called “dataset” and fill it with our data. For the bar graph, we used an array: [value 1, value 2]. This time we will use an array of arrays (see below). In the first row, we create the variable and open the array. In the second row, we enter another ray with two values 20 and 2.5. Twenty is the number of Pulitzers (or finalists) this organization has had and is the value for the y-axis; 2.5 is the circulation (in millions) and the value on the x-axis).

var dataset = [
 [20, 2.5],
 [62, 1.9],
 [1, 1.7],
 [41, 0.7],
 [2, 0.6],
 [2, 0.5],
 [0, 0.5],
 [48, 0.5],
 [1, 0.4],
 [8, 0.4],
 [15, 0.4],
 [6, 0.4],
 [6, 0.4],
 [3, 0.4],
 [2, 0.3],
 [6, 0.3],
 [11, 0.3],
 [7, 0.3],
 [8, 0.3],
 [4, 0.3],
 [2, 0.3],
 [2, 0.2],
 [15, 0.2],
 [5, 0.2],
 [5, 0.2],
 ];

Next, we need to define the size of our graphic. This is as simple as creating a variable for the width and the height.

var w = 700;
var h = 500;

Now we can create the graphic. First, we need to create the wrapper for the graphic (this is the space on the page where the graphic will be inserted.

var svg = d3.select("body")
 .append("svg")
 .attr("width", w)
 .attr("height", h);

Basically, the above code says, “Use D3 to append an ‘SVG’ or Scalable Vector Graphic to the body and make it ‘w’ width and ‘h’ height.” Now we just need to tell D3 what to put in that space.

svg.selectAll("circle")
 .data(dataset)
 .enter()
 .append("circle")
 .attr("cx", function(d){
 return d[0];
 })
 .attr("cy", function(d) {
 return d[1];
 })
 .attr("r", 5);

This code starts similar to the last example by appending a circle to each item in our dataset. Then it adds some attributes to the circles with the “.attr” class.

The first attribute is “cx” or the position on the x axis. We pull the value for this attribute by using the magical “d” variable. Remember, the “d” variable, when passed into a function, will automatically cycle through our whole dataset creating a value for each item. In our last example we used it to define the height of each bar. Now we are using it to define the position on the x axis. The value the function returns is d[0] or the value in the zero position – which is the first number – for each item in our dataset.

Next we do the same thing for the “cy” attribute. Except we return d[1] or the second number in our array. Finally, we must define the radius of our circle. For now we will just give each circle a radius of 5 px.

We can now see our amazing graphic.

WHAT?!?! What happened?

Here’s the code. What is going wrong?

55490060

Did you figure it out?

Yep! It’s that we need to scale the variables! Remember in the bar chart, we used the following code to scale up the height.

.style("height", function(d) {
	var barHeight = d * 5;
	return barHeight + "px";
});

We multiplied the data by 5 to make the bars tall enough for the page. This code worked, but we can also have D3 do this for us. We should have D3 do it for three reasons:

  1. Works better for axes (that we’ll be doing soon enough)
  2. It’s easier. When we have a lot of data its not always easy to guess what you should multiply by to make it look nice.
  3. It’s adaptable. If we end up needing to add data to our chart, it might throw of our manual scaling. If we automate it, D3 will do it all for us.

To automate the scaling all we need to do is write a function which can create the magnitude for the scale and then change how the attribute creates the x and y-dimensions. We’ll start with the function:

 var xScale = d3.scale.linear()
   .domain([0, d3.max(
     dataset, function(d){
       return d[0]; 
     })])
   .range([0, w]);

What we’re doing here is assigning a function to the variable “xScale.” The function figures out the correct scale in order to fit all of our data in the space we allotted in the wrapper (i.e., 700 x 500 pixels). It does this through setting a domain and a range. The domain is values in our data; the range is the values of our output. In our example, the domain is 0 to 62 and the range is 0 to 700.

The above code calls the linear scale function built in to D3. Then we assign the domain and the range. The domain is set to 0 and the maximum value in our dataset. Then the range is set to 0 and the width of our space. Our data only goes up to 62, but if we all the sudden need to add a 100 to the data, D3, using the max function will find that value and adjust the scale properly.

We do the same thing for the y-axis.

 var yScale = d3.scale.linear()
 .domain([0, d3.max(
 dataset, function(d){
 return d[1]; 
 })])
 .range([h , 0]);

Then all we need to do is update the attributes in the code that actually draws the graphic.

 .attr("cx", function(d){
    return xScale(d[0]); 
 })
 .attr("cy", function(d) {
    return yScale(d[1]);
 })

You’ll notice that this isn’t much different than the last time around. All we added was instead of directly returning the value from our dataset, we run it through the function we created so it is properly scaled.

We can now see our updated amazing graphic.

What is wrong now?? And why is it happening?

Here is the code.

55490060

Did you figure it out?

You’re right…our circles on the edge are getting cut off. That’s because they are falling outside of the space we set up for our wrapper. For example, one of our data points has a value of zero. That means the center of the circle is on the “0″ line, which is the very edge of the wrapper. So half the circle falls outside the wrapper.

We fix that by putting a little padding around the chart. We start by adding a variable to define the width of the padding. We don’t need much.

var padding = 30;

Then we change the scaling functions to incorporate the padding. Specifically, we need to change the range.

 var xScale = d3.scale.linear()
 .domain([0, d3.max(
 dataset, function(d){
 return d[0]; 
 })])
 .range([padding, w - padding]);

All that I changed was the last line of code. It was “0″ and “w” or 700. Now it is “padding” or 30 and “w – padding” or 700 – 30.

You do the same thing for the yScale variable. You can see what it ends up look like here. Here is the current full code.

It really is starting to look nice, but we need to add some axes to this chart.

First we start by creating a variable that calls D3′s axis function. Like so:

 var xAxis = d3.svg.axis()
   .scale(xScale)
   .orient("bottom")
   .ticks(5);
 
 var yAxis = d3.svg.axis()
   .scale(yScale)
   .orient("left")
   .ticks(5);

So we call the function. Then tell it to use the scales we already set up, orient the axis on the bottom and left, respectively, and break the graphic up into 5 ticks, or major units in Excel parlance.

Next, we have to actually draw the lines on to the SVG we already created.

 svg.append("g")
 .attr("transform", "translate(0," + (h - padding) + ")")
 .call(xAxis);
 
 svg.append("g")
 .attr("transform", "translate(" + padding + ", 0)")
 .call(yAxis);

With the above code, we are adding to our SVG. Unlike the “circle” or “bar” we already drawn, we are now using the “g” class to draw a whole group of things, (e.g., lines, text). Then we use the call function to pull the math axis info we already defined. Finally, we add an attribute to move the axes to where we want them. It should look like this now:

Screen Shot 2015-03-31 at 1.58.30 AM

 

Next, we can style the axes using CSS and adding an attribute. The styling is just like styling the bars in the example from Thursday.

<style type="text/css">
 
 .axis path, 
 .axis line {
 fill: none;
 stroke: black; 
 shape-rendering: crispEdges;
 }
 
 .axis text {
 font-family: sans-serif;
 font-size: 11px;
 }
 </style>

Then we apply it via another attribute.

.attr(“class”, “axis”)

Finally, we have a good looking graphic with axes. Anything weird?

Here is the current code.

Using D3.js

D3 (or Data-Driven Documents) is an open-source JavaScript library for making data visualizations. Pretty cool, eh?

Oh…you’re asking yourself, “what is an open-source JavaScript library?” Well, the first part, open source, means that the source code is publicly available. In many cases, open-source software can be freely downloaded, edited and re-published. For more information about open-source software, check out this annotated guide.

The second part, javascript library, means that it is a collection of JavaScript functions that you can reference in your code. Basically, it is a bunch of code that other people wrote so you don’t have to! All you have to do is point to the library and tell it what you want to do.

Pointing to the library is easy. You just use the <script> tag and tell the browser what script you want. Generally, you can either host the library on your server or point to the library on the creators server. If you point to their server, you’ll automatically get updates (depending on their naming/workflow), which is good and bad. It is good in that you are using the newest software. It is bad in that they might update something in a way that ruins your project. I personally lean toward hosting on my server.

To host on your server:

  1. Download the library from the D3 website.
  2. Upload the library to your server
  3. Point to the library using the following code:
<script src="/d3/d3.v3.min.js" charset="utf-8"></script>

To leave it on their server:

  1. Just insert this code:
<script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script>

We have successfully navigated step one. Our browser will know that it needs to use JavaScript, because of the <script> tag, and it will load the correct JavaScript library, because we told it where the library is by using the src attribute.

Now we can move to step 2. Actually making a graphic using the library. To do this, we can just put directions in between the opening and closing <script> tags (which sounds easy).

The first thing we have to understand about D3 is that we are using code to create everything in the chart. This is amazing, because it is highly adaptable and lightweight. This is also a drawback, because that means that there is a steep learning curve and can be a bit daunting at the beginning. Let’s start by looking at a chart and break it down to it’s various elements.

Medal of Honor recipient origin: Top 5 states

Medal of Honor recipient origin: Top 5 states

 

What do we need to create this graphic?

  1. Data
  2. Type of chart
  3. Axes and labels
  4. Color coding

We are going to need to explain that all to D3.

First, lets deal with the data. You can get data to D3 in numerous ways. For now we will enter the numbers in an array and assign it to a variable. You can also point D3 to CSV files, JSON data, and numerous other file types. I haven’t looked but I assume you could point to a Google Spreadsheet. Regardless, here is the snippet of code we’ll use to encode our data:

 var dataset = [ 12, 15, 20, 24, 25, 18, 27, 29];

This code should makes sense. We are creating a variable (var) named dataset and we are assigning our values to it.

Now we need to decide the way in which we want to display the data. For now, we will create a simple bar chart. So we need to style a bar. To do this we are going to use Cascading Style Sheets. Like a newspaper’s style guide, CSS is used to format content on a given page or across a site. So you use HTML to enter the content and then CSS to style said content. We are going to assign the style to a DIV tag. We’ll add the class “bar,” so it isn’t applied to all DIVs on our page. Here is the snippet of code:

div.bar {
 display: inline-block;
 width: 20px;
 height: 75px;  
 margin-right: 2px;
 background-color: teal;
 }

This will make the default bar 20px wide, teal, and with a 2px margin. Right now, the bar is 75px tall, but we will adjust that based on our data.

Finally, we need tell our browser that we want to use D3 to use this style to draw a bunch of bars representing our data. Here is the code we’ll use to do that:

 d3.select("body").selectAll("div")
 .data(dataset)
 .enter()
 .append("div")
 .attr("class", "bar")
 .style("height", function(d) {
 var barHeight = d * 5;
 return barHeight + "px";
 });

OK…this snippet of code looks a lot more confusing. In English, this code says, “Append each item in our dataset to a div of the class bar and adjust the height of the bar based on its value.”

One of the coolest things about D3 is using the built-in “d” variable to cycle through all the values in a dataset. In our case, D3 pulls up each value then multiples it by 5 and assigns that value to the height of the bar it is drawing.

Now we have all the building blocks for a basic bar chart. We can organize it in an HTML as follows:

<html lang="en">
 <head>
 <meta charset="utf-8">
 <title>D3 Demo: Making a bar chart with divs</title>
 <script type="text/javascript" src="../d3/d3.v2.js"></script>
 <style type="text/css">
 
 div.bar {
 display: inline-block;
 width: 20px;
 height: 75px;
 margin-right: 2px;
 background-color: teal;
 }
 
 </style>
 </head>
 <body>
 <script type="text/javascript">
var dataset = [ 12, 15, 20, 24, 25, 18, 27, 29 ];
 
 d3.select("body").selectAll("div")
 .data(dataset)
 .enter()
 .append("div")
 .attr("class", "bar")
 .style("height", function(d) {
 var barHeight = d * 5;
 return barHeight + "px";
 });
 
 </script>
 </body>
</html>

If we uploaded that file, we would get the following chart:

Screen Shot 2014-04-22 at 12.48.25 PM

Maybe it isn’t the most beautiful chart, but it is all code…no JPGs, no Google Charts…just code.

ED NOTE: I am not sure how long this will take in class, so I am skipping ahead to updating the dataset. I will come back to axes and labels. 

A code-driven chart is cool, but an interactive chart is even cooler. So let’s do that.

What we’ll have to do is add an object with which the user can interact (i.e., click). Then we’ll have to add code that tells D3 to listen for a click and update the data when it hears it. For the object, we’ll just create a simple set of text using the <p> tag. Here is the code we’ll use:

<p> Conference standing </p>

Now we need to add the Event Listener and tell it to update the data. Here is the code:

d3.select("p")
 .on("click", function() {

//New values for dataset

dataset = [ 7, 3, 4, 2, 2, 3, 2, 1 ];

//Update all bars
d3.selectAll("div")
  .data(dataset)
  .style("height", function(d) {
      var barHeight = d * 5;
      return barHeight + "px";
  });
});

Although this looks complex, we can easily walk through it. We are telling the browser to listen for any clicks within a <p> tag. Then once it hears the click, it executes the  function. Within the function, the dataset is updated with our new data and the bars are redrawn.

You can see the fruits of our labor here.

Pretty cool, but pretty useless. Am I right?

We can easily make this better by adding an IF command to our Event Listener. You should remember IF commands from some of our work in Excel. But basically an IF command says:

IF (logical statement comes back true) { 
     Do this
}
ELSE {
     Do this
}

We can start this process by giving our user two interaction options, like so:

 <p id="wins">Wins per year</p>
 <p id="conf">Conference</p>

We do the same thing as early – use the <p> tag – but this time we add unique id’s that we can reference later.

Then we just add the IF command to our Event Listener:

d3.selectAll("p")
 .on("click", function() {

 //See which p was clicked
 var paragraphID = d3.select(this).attr("id");
 
 //Decide what to do 
 if (paragraphID == "wins") {
   //New values for dataset
   dataset = [ 12, 15, 20, 24, 25, 18, 27, 29 ];
   //Update all bars
   d3.selectAll("div")
     .data(dataset)
     .style("height", function(d) {
        var barHeight = d * 5;
        return barHeight + "px";
     });
 } else {
   //New values for dataset
   dataset = [ 7, 3, 4, 2, 2, 3, 2, 1 ];
   //Update all bars
   d3.selectAll("div")
     .data(dataset)
     .style("height", function(d) {
        var barHeight = d * 5;
        return barHeight + "px";
      });
   }
 });

All that we added was two options. If the user clicks wins, we keep with the original dataset, and when the user clicks conference we insert the new dataset.

You can see the chart here.