Welcome to the thrilling conclusion of last weeks’ research!
I said last week that some real-life stuff had gotten in the way of finishing off this project, and now I can talk about exactly what that was.
It was actually the project that got in the way of the project.
Now that may not make much sense, but when we go back over exactly what we were going to do, it may.
We had a list of about 14,000 malware related domains, and we are going to make a GET request to each of them, 10 times each. That sounds fine, but if you boil that down a bit, “we are going to make 140,000 GET requests to malware related domains” actually sounds a bit foolish.
Well my ISP agreed with than sentiment, and took my normal 250MB line down to 0.5MB. I got rate-limited.
A quick call to their abuse hotline, and they confirmed that it might have been a bad idea by saying “You might have a virus on your computer.”
It also turns out that they don’t do this automatically either. They have some sort of automated system, and that generates an alert for an abuse desk worker. The worker then goes to the file of the customer and checks to see if there is any relevant info, and then will either enact a block or rate-limit of some kind, or do nothing.
When I phoned up, they were extremely understanding, and assured me that in a minute or two I would get the full connection speed back, and that we could arrange a note in my account details that would notify the next person to check my account details that all this malware related stuff is actually us doing legitimate research.
So with that out of the way I made sure that now I could run the script as many times as I wanted, I should definitely do it more than once!
So I did.
Not a huge amount of times though, we should be wary that each time we do this, likely someone, somewhere in a call center in Stockholm has to go through my file and justify this false positive.
So if we take a look at the script we made, we can see how the file should output it’s results. I’ve included it here because I can’t remember.
from time import sleep
urls = 
threads = 
"""This function does the important work"""
for i in range(1,10):
req = requests.get(url + "/?author=%d" % i, timeout=3)
m = re.search('(author author-).*(author-)', req.text)
print "Found \t %s from Url \t %s" %(m.group(0)[14:-8],url)
with open(sys.argv, 'r') as file_object:
lines = file_object.readlines()
for line in lines:
if line.strip('\n') not in urls:
for item in urls:
t = threading.Thread(target=worker, args=(item,))
So if we look on line 18, we can see the output should be “Found” then a tab character, then the login name, then “from Url” then another tab character and then the url we are testing.
As we can see here in this ‘confidentiality enhanced’ screenshot, the formatting is a little off, but this is due to the sometimes extremely long usernames that people can have, such as:
You can also see that I ran this 4 times, and each time there was a different number of results. This is pretty normal behavior when it comes to the internet. Domains change all the time, and maybe even some of those WordPress installations get taken down, or cleaned up from hosting malware.
Now for the interesting part, frequency analysis!
I added all the files together into one big list of results, it was 3536 lines long. Now we can chop out all the duplicate lines by using “sort final | uniq | wc -l”. This command just sorts all the lines in the file into a specific order (alphabetic I think), and then pipes that input into filtering for the unique ones, then pipes that input into counting the number of lines. I ran this again and chomped off the wc -l at the end and redirected the output into another file I called “sorted”. Sorted.
The result was that we were left with 1181 lines. That is around 50% more than when we first ran this script and got back only around 700 lines!
So now let’s start looking at the results. The “cut” command is in my opinion, one of the most useful commands for processing text files. We can take a file like “sorted” and cut out only fields we want to use. We can also select how we determine what the fields are separated by.
That means we can select the space bar as a character, and then we would get out all words, or be able to show words based on how many spaces into a line they are. You may remember that we made our results with a tab character (\t) in them, and the results looked a bit out of alignment when we viewed them. This was not an accident.
So if we give the command “cut -f2 sorted”, we will get back everything in field 2. Cut uses the tab character as the default field delimiter. The delimiter is just a fancy word for saying “What character should I cut on?”
We can also use this opportunity to do some amazing regex magic!
By issuing the command, “cut -f2 sorted | grep admin | wc -l” we will get a confirmation that internet security is terrible. 283 sites have admin in their login name. Now to be fair, that includes the quite ridiculous “hjydtobakinadmin_01_sygora7lacio”. If that was a typical users password, I think we would have way less than 14k domains to look at.
So around 25% of sites have the word admin somewhere in their name. Let’s use some regex to find out which ones are only admin.
“cut -f2 sorted | grep “^admin\b” | wc -l” will do most of the work for us, but it also has three of “admin-2” in the results. I need to learn how to regex better, after-all, it is magic.
So after this, we find out that there are 217 domains which have just admin as the username. That is 18%. That makes me sad. There are five that have changed it to administrator
Using the “uniq -c” command, we can prefix the number of times a login name is used in the file, and the results are pretty similar to what we might have expected.
Many different uses of admin. “Service” has 18 mentions, and oddly, just a single “-” is the login name of choice for 13 people. “Test” is also a useful login name with 4 people feeling the same way about it as the 4 other individuals who chose “test-1”, “test-2” etc.
Now apart from all this, there are some other interesting information we can get out of the actual domains themselves. Normally, if people are going to be installing malware, that will happen by someone clicking on a link, and then being redirected from an email where they clicked on the link to a site that hosts the actual malware.
To get people to click on the links in emails, generally the scammers will attempt to create a driving force for you to click on that link. This could be a number of reasons, such as “Here is a bill that you have to pay” or “Hot sexy people in your areas”.
As such, they will very often attempt to use a domain that, given a quick glance, could be legitimate, or at least long enough to not fit in the url of a phone screen. Smoke and mirrors.
So with that in mind, let’s take a look in our domain file for some interesting words.
Well, that is 36 domains that are most likely not legitimate in any sense of the word. If we search our results file for the word google, we find there are zero results. This kind of jives with what we had been assuming, which is that domains are either set up on purpose to host malware, or they are not. If they are not, then they are probably hacked, and if they are hacked then there is a chance they are using WordPress with less than good credentials.
I ran a few more checks against this list, for domains I thought might be phishing/scamming related, and how those results were echoed in our WordPress site urls.
Facebook – 33
Paypal – 176
Microsoft – 13
Yahoo – 4
Twitter – 0
Instagram – 1
Snapchat – 0
Steam – 13
Security – 60
Amazon – 14
Hotmail – 1
Outlook – 1
Adobe – 8
The results again align with what we have come to expect. Almost none of these words appear in our list. There are a couple that occur, but each time they do, its in a login name, and its an email provider, because they have obviously used their email address as their login name for the WordPress installation.
Something that seems interesting too is that there is very little in the way of more modern social media outlets in the list, like Twitter or Snapchat. I suppose it is due to the value of a hacked social media platform being relatively low, compared to the value of a hacked paypal account.
I hope this post has convinced you to keep a strong set of credentials for whatever online services that you use. We can see how easy it is to get login names, and if the password is “Password1”, then it is going to get hacked.
Also, remember that if you are going to try something like this at home, which I would 100% encourage you to do, on your quest for learning, then please notify your ISP first. Not only then do they have a heads up, and you can save everyone a bunch of time as you try to get un-throttled, but they may have something in their terms and conditions about it. Heck it might even be illegal in your country.
I hope this blog has shown you the awesome power of threading that just a tiny bit of knowledge can bring!
Have an amazing week everyone! Next week we are going to try drilling holes in circuit boards!