Sorry for not having posted for a couple of weeks, but I have had a severe case of not finishing projects, and constantly starting new ones.
Fortunately today I have something to share with you after I forced myself to finish one of these projects, and it is something that I think has been on everyone’s minds.
Threading in Python.
Let’s step back a little, and go over why I got to this place.
During my day job, I spend a lot of time staring at the user interface for Burp Suite. As we have discussed previously, Burp Suite is a proxy that allows you to examine, modify and repeat http requests (among other things).
It looks like this.
I would highly recommend that you download and play around with Burp no matter what your profession. Knowing what is going on under the hood of your browser, web application or micro-service is not only invaluable, but fascinating.
It will also show you every request that you make in the nice, sort-able, history tab you see above. I use Firefox when testing most of the time, as it does not have a XSS filter like chrome does that blocks some of the more lame things that I find.
I once left it running over lunch, and when I came back, I found several requests that I had not specifically made. Most of them were to shavar.services.mozilla.com, which is a tracking protection service from the Mozilla foundation.
There was another one though that really caught my eye, which grabbed a list of malware domains, presumably to throw a warning if users go to them in order to not get infected.
Obviously I downloaded the list of domains and sent it home for later study. This spurred me on to find more malware related domains that I could use for some form of research. The first list that I got from Burp was only 1219 lines long. Barely an exhaustive list of what must be thousands of domains on the internet that are either hacked and hosting malicious files, or purposefully set up as places to serve malware.
I looked around Google for a bit and found an open project where users share and curate a large list of malware related domains that are free to download and use. We immediately added the domains to our blocked list, and I felt like I had made good use of my day and helped increase the security posture of our organisation. I also had this larger list of 12701 domains to play around with too!
I didn’t know what I wanted to do with these domains, and it’s good that I sort of left them on the side for a while, as it turns out that I wanted to run the two TinyTools we developed in the last couple of posts against the domains.
Just to cover what those were, first there was a tool that grabbed usernames for WordPress sites, and second was a tool that would attempt a zone transfer against a site. Now neither of these things are indicators of compromise (IOC’s), but I feel like it’s possible that malware domains will either be set up on purpose, or not. If they are set up on purpose, then likely we will not get that much information out of them. If however, they are hacked, then it is at least somewhat likely that we will be able to get some information out of them.
Like I said in the previous post, WordPress gets a bad rap, and in-frequently updated and overly plugin-rich WordPress just gets hacked. An ill-configured name server that allows a domain transfer is not per-se proof that somewhere has been hacked, but it seems like the two might be related. If you don’t care about your Name Server much then it is unlikely that you care about updating your site, or the security of it.
So in summary, what we are about to do proves nothing but it is super interesting, and research can sometimes yield interesting results, even if you know what you are doing is not going to prove anything new. As I write this, I am really hoping that we find something awesome, because the scripts have not yet finished, and I haven’t analysed the results.
So let’s start talking about threading.
When I grew up, the only interaction with programming I had was in the form of BBC BASIC, which has syntax a little like this.
10 PRINT "HUGO IS SUPER AWESOME!"
20 GOTO 10
We never really got past how awesome that was, so we didn’t really explore for/while loops or anything else like that. Now that I am playing around with Python and making small tools, I sometimes have a need to run them against large data-sets, such as the one we have been talking about.
If we remember from the WordPress user checking tool, we basically make 10 web requests to determine non-scientifically if there are any usernames, and thus if it is a WordPress install. The total number of lines we have is 13920, so 140,000 web requests will take some time if we do them one after the other, as we would if we just sent the domains to the tool. Probably the same will happen with the domain transfer tool too.
What if we could make them all at the same time? Or perhaps not all at the same time, but within pretty short succession of each other. After-all, if we sent out 140,000 web requests, each would have to come back to a separate port on our computer, and we only have around 65,000 ports. We would DOS ourselves.
Not only do we need to space the requests a bit, but we also need to avoid melting our router. So this is where threading comes in. We can create a function called a “worker”. It doesn’t have to be called that, but it is the function that will do all the work, so it seems apt. Let’s keep that nomenclature for our script.
We will then make a list of threads and a list of urls. These will get populated with urls from our file (that still needs some additional sanitisation) and with threads. Now the worker function we make on line 12 is pretty much just the WordPress username enumerator script, with a bit of printing to a specific format added in at the end, and a timeout to make sure the script doesn’t hang.
On line 23 we load in a file that was given when the program was started, and read the list of urls from it and append them to our list so that we can use them later.
The magic for me, who has never really understood how to do threading before, happens at line 28 onwards. In pseudo-code, and what I wished that someone had explained to me perhaps 8 years ago, says the following:
For each url in your list, make a threading object, which will perform the function you define as ‘target’, and if you want to feed it a paramater (such as a url), then you just need to put that in brackets afterwards, and call it ‘args=’. Then add that threading object, all ready to go, to the end of the list of threading objects we are making, and start it.
The script looks like this.
from time import sleep
urls = 
threads = 
"""This function does the important work"""
for i in range(1,10):
req = requests.get(url + "/?author=%d" % i, timeout=3)
m = re.search('(author author-).*(author-)', req.text)
print "Found \t %s from Url \t %s" %(m.group(0)[14:-8],url)
with open(sys.argv, 'r') as file_object:
lines = file_object.readlines()
for line in lines:
if line.strip('\n') not in urls:
for item in urls:
t = threading.Thread(target=worker, args=(item,))
I really had a penny-drop moment when I understood this. I had made plenty of scripts that used threading before as I learned python from books. I never truly understood what it was that I had to do to make it work though. Now I want to thread all the things. If you don’t get filled with emotion and awe at this point, you may be a robot.
You may also be wondering why there is a sleep command in there. I put that in so that there would be a little gap between the creation of the object and the execution of ten requests that may ultimately time out.
Now we have our script ready, we have to prepare our list of domains a little bit. I tested out the first ten domains, and nothing worked. This turned out to be because they lack the http:// section of the url, so we need to add that to each line.
This would be a daunting task if it were not for three things. Google, StackOverflow and Sed.
First we concatenate the two files we had ‘domains’ and ‘domains2’ using:
cat domains domains2 >> final
I checked the file for duplicate lines with the uniq command line tool, but there were none. Next we need to get that ‘http://’ in-front of every line, so after a few seconds with Google, we find the syntax for prefixing each line in a file with something is
sed -e 's/^/prefix_goes_here/' file
We can also add a redirect at the end with ‘> file2’ to put all the results into our file. The above sed instruction sure looks like magic to me, but after a bit of digging and fond remembrance of regular expressions, we can de-mystify it a bit.
Sed is being given a script command to perform by using the ‘-e’ flag. What follows is ‘s/regex/replacement/’. The ‘^’ character is the regex, and simply means “At the beginning of each line”. The replacement is exactly what it sounds like. It’s also useful to note that just trying
sed -e 's/^/http:///' final > final2
won’t work. Why? Because those ‘//’ at the end of the http are interpreted as part of the sed command. We need to escape them using back slashes.
sed -e 's/^/http:\/\//' final > final2
Thus we get our file of 13920 domains, all formatted and ready to go with our WordPress script baked in. I took the first 10 domains using the head command and timed how long it took as a completely non-scientific way of finding out how long the whole file might take, even with our awesome threading ‘optimisations’.
Hrm. Given that there are 3600 seconds in an hour, then….
(1392 * 22)/3600 = 8.5 hours.
Well It’s a good thing I am going to leave this to run overnight. Let’s split the file into smaller more manageable files of around 1500 lines each, and then we can make a for loop to run our script against each file.
There is a command line utility called ‘split’ which does exactly what it sounds like it should. We can feed it our file and ask it to turn it into files of however many lines we want, or bytes, or whatever. It will spit out these files and give them a specific naming convention too, which will really help us out.
split -l 1500 domains xyz
This will give us a bunch of files with 1500 lines each, and the name xyzaa, xyzab, xyzac and so on. We can then use the following
for i in $(ls xyz*); do python threaded_wpusers.py $i; done >> results
to loop through them all. The $(ls xyz*) is using the wildcard * character to list all files that start with xyz and then have any characters afterwards.
And sadly, that is where I am going to have to leave this blog post. That’s right, it’s a cliff-hanger!
I have the rest of this adventure mostly done, in fact its 90% done, but there are some real-life things that have gotten in the way that I have to sort out first. I promise the next episode will be up on Monday.
So until then, I hope you have a great weekend!