This is going to be one of the more unusual HOWTOs I’m likely ever to produce for this site. It all started earlier today when I asked myself a seemingly simple question:
What is the average weight of famous celebrities that are roughly my height and age?
I was curious because I was wondering how far off the mark I currently was in terms of weight compared to what pop culture considers attractive.
In this day and age, finding the stats on celebrities is frighteningly simple. Aggregating those stats into something meaningful to my question however proved to be more complex.
First and foremost, I’d like to give full credit to the website with which the following information is based on:
With that out of the way, let’s have a look at their site:
It looks nice enough. The vital statistics of over 400 celebrities are listed here. However, filtering it to get the information I wanted was impossible.
The only option then would be to extract and compile the data myself. Let’s begin shall we?
- Right click on the webpage and view source
- Press Control-A to capture all of the data and paste it into notepad
- Delete everything prior to the first reference for DATE OF BIRTH for the first person listed on the website (At the time of writing, that is Julie Andrews)
Use the find and replace feature in Notepad (or better yet notepad2 as it’s a lot faster) and make the following replacements:
- Replace </div> </td> with blank
- Replace <td> <div class=”okvir”> with blank
- Replace <div class=”okvir1″> with blank
- Replace <span class=”ttt”> with blank
- Replace </span> with blank
- Replace <br /> with blank
Save your changes
- Now is where the real magic happens. The file contains a lot of HTML formatting garbage that we simply don’t want. To extract only what we do want, open a Powershell console
- Run the following command:
get-content .\celebweight2.txt | select-string -pattern “DATE OF BIRTH:” -context 0,10 | clip.exe
- What this does is reads in the text file we created earlier and then uses the select cmdlet to grab not only the line that we search for, but also the next 10 lines after a match is found
- Lastly we pipe the results straight to the clipboard which saves us having to write it to a temporary file
- Open a new instance of notepad and paste in what’s in the clipboard and save the file
- Open Excel, and open the file you just saved above
- Accept default detected column formatting for import
- You should get something that looks like the following:
- That’s great but we want to make a table out of it so we still have some more work to do
- We need to build a relationship that says that every 5 rows belongs to the same “object”
- To do that, fill in the number 1 in each of the first 5 rows (that represent Juile Andrews)
- In the 6th row, enter the formula =IF(A6=”DOB”,C5+1,C5) and fill down for the remaining entries
- This says only increment when you see the DOB entry which satisfies our requirement
- Select all your data and copy everything and paste special into a new sheet, selecting Values so formulas are removed
- Select all of the columns and Sort by Column A (our “categories”) and by Column C (our unique object identifier value)
- Manually move each category of data into its own matching columns as shown below
- Create column headers and remove unnecessary columns
- At this point, we have a table that is starting to look like what we’re looking for. But to make things simpler for comparison, we need the age of each celebrity
- Select the DOB column and use the Replace function in Excel to remove the periods. We will remove all of them which is acceptable for our needs
- Next, we need to subtract the current year from the extracted year of their birth to obtain their age using the formula
- Next, because I live in a metric speaking country, I want to remove the imperial measurements. To do this, select the weight column and use the Text to Columns function under the Data tab
- Select Delimited and use the “(” as your delimiter. This will move the kg weight into its own column
- Use the Replace function on the new weight column to remove the other bracket
- Repeat this process for the height column
- Repeat this process as well to remove the cm and kg suffixes as their presence means the cell will not be a number type and so we won’t be able to filter it
- At this point, we now finally have a table that is nearly ready to answer my initial question
- But as you recall, the question involved comparing celebrities with vital statistics similar to my own
- To filter the list, select all columns and then press the Filter button under the data tab
- Select the age filter and choose Number Filters / Between
- Enter the ages of 29 to 33 which is the range I’m interested in
- Repeat this process for height using the values 178 to 184 (kg)
- We’re really close now. The only problem left with this list is that includes a number of women which obvious is unnecessary to answer my question
- Since I wasn’t familiar with most of these people, a quick type of their name into google and let google autocomplete automatically show me their picture allowed me to determine gender and remove all of the women
There you have it! Out of 408 celebrities listed on the site, 15 of them match my gender, age and height. Of these 15 people, the average weight is…
If we graph that out (because, hell we can), we see that some guy named Henry Cavill is really the only outlier.
He must be some fat dude then, right? Actually, it turns out that’s the guy from the new superman movies so he has unusually more muscle than even the rest of the people on this list.
Well, that’s 2 hours of my life I’ll never get back. But now I know. J