top of page

The H-Net Job Guide data set went through multiple iterations - to both clean up and organize the data - before being entered into Tableau. The final data set provided us the opportunity to explore the data's full potential towards analyzing the academic job market.

 

The process of cleaning this data set was multi-faceted and incredibly meticulous. There were many problems that arose when confronting the original data set at surface level, and several more when diving deep into the details of individual columns and entries. In the beginning, each data entry was supported by at least ten columns of information within an Excel spreadsheet. The various columns included the ID number of the job posting, the postdate, the institution, the type of institution, the state/province, the country, the department within the institution, a brief written description of the position, and a textual analysis providing additional information to the posting. There were also a string of columns firstly named Primary Category, then subsequently named Category 2, 3, etc., all the way to 35. These identified a more specific area of study for the position posted. From this basis we then narrowed the scope of our project by cleaning the data set.

 

First, we eliminated all category columns beyond Primary Category since data entries exponentially trailed off, leaving more and more entries blank and arbitrary in our data set. Our team then decided to focus our project on the variables of each entry we had the most control over - meaning, we looked for the particular columns of data that would best help support our research question. To do this, we organized certain columns into what we’ll call organic and generated types of data. Organic data was categorized as any piece of information curated solely by the poster; these included Institution, Department, Description, and Text. Of these, we concluded that only the Institution names would be necessary moving forward as the other columns would be far too unique to warrant corroboration amongst other entries. The remaining points, the generated data, were the columns contained within the H-Net Job Guide database; these included ID Number, Postdate, Job Type, Country, State, and Primary Category. These can be organized uniformly without fear of typos. We decided that all but ID were necessary to our data set and research questions.

 

After the initial cleaning of our data set, we decided to include Postdate, Institution, Type, Country, State, and Primary Category in the final data set draft. From this point, we delved further into the data and discovered a slew of problems within the entries. We created an additional column within our data set labeled “Fixed Institution” to identify these changes and refrain from altering the received data. We went row by row analyzing institution names to ensure that when put into a visualization program, they would produce the most accurate results possible that was true to the intended data. Common errors that we encountered were the incorrect spellings of institution names, the inclusion of articles unnecessary to the institutions’ names, the improper use of punctuation and capitalization, and (most difficult to correct) the incorrect phraseology and ordering of university systems. The latter involved our own research into institutional structures relative to the division of state universities, independent colleges, and satellite campuses, and then the subsequent organization of said structures into a proper and uniform phraseology. It was imperative that these structures were correctly identified lest our visualizations would not uphold the integrity of the data set. Once the Fixed Institution column was completed, we created two more columns of derived data. The first was Regional Accreditation, which geographically organized the institutions into a broader collection of state regions across the United States. The second column we labeled Academic Consortium and indicated such institution groupings (i.e. The Big Ten, The Ivy League, etc.). At this point we decided to limit our Type column to just institutions categorized as “College / University” and abandon the other numerically insignificant postings of institutions such as Government, Museum, etc. After nearly three and a half months of editing the H-Net Job Guide data set we were finally able to input the data into Tableau, create and analyze our visualizations, and make our conclusions.

bottom of page