Overview:
We have millions of web sites! They all offer tons of data which can be viewed only by a web browser. Most of the sites do not provide any mechanism to save the data in your local machine we are interested in. Web Scraping is a process of extracting data from web site based on the given requirement.
Automation engineers know how to automate the process of launching a website and extracting certain information from a site based on a given requirement. Automating a browser is not a robust way of extracting the data!
There are commercial tools for this purpose! In this short article, I would like to show you how we could use JMeter for FREE for web scraping.
Sample Requirement:
Lets assume that We need to collect the price of given stocks from google finance website. Write the price in a text file every minute.
Steps to follow:
- Launch a browser
- Go to www.google.com/finance
- Enter the stock name
- Search
- Get the price
- Write it in a file
- Repeat the above steps for other stocks
- Repeat the above steps every minute
Some people use Selenium for this requirement because it is one of the browser automation tools!
But, We are going to use JMeter for the following reasons!
Why JMeter:
- JMeter is FREE & Open Source.
- JMeter makes HTTP requests like a browser and get the data
- JMeter is not a browser – does not have to launch a browser
- JMeter is light weight compared to web browsers
- JMeter can skip all the static file requests (images, css, js etc)
- JMeter is very quick in getting the responses (because it does not have to render the HTML page as the browser does)
- JMeter has a powerful regular expression extractor which gets the specific text we are interested in.
- JMeter has a lot of plugins. Ex: To read the data from CSV. Browsers / selenium do not have that option.
- JMeter is multi-threaded – We can spin up hundreds of threads easily. That is, It can process all these requests in parallel. Automating 100 browsers in a machine is not an easy task!
Lets start with JMeter script for the above requirement.
Scripting In JMeter:
- First, I create a data.csv file with the list of the stocks to be monitored as shown below.
- JMeter test plan will have 2 these variables. jmeter.test.home will have the value of current directory. result – it is an empty string.
jmeter.test.home=${__BeanShell(import org.apache.jmeter.services.FileServer; FileServer.getFileServer().getBaseDir();)}${__BeanShell(File.separator,)}
result=
- Add a loop controller with the number of stocks – in our case, it is 4. [Instead of hard coding, we can get this programatically]. Loop controller is added to repeat the process of all stocks in the CSV file.
- Add a CSV data set config under the loop controller to read the csv file and feed the data to the JMeter script. I will use the variable ‘stock’ in Variable name field.
- Add a HTTP Sampler under the Loop Controller & update the server name, path and request details as shown here. This HTTP Sampler will simulate the below browser request.
- Add a Regular Expression Extractor under the HTTP Sampler to extract the specific data from the HTTP response. For example, the data we are interested in the HTTP response as shown here. so, Regular expression extractor should be updated to fetch the data.
- Add a Beanshell post processor / JSR223 pos tprocessor with below code.
result = vars.get("result") + vars.get("reg.price").replace(",", "") + ",";
log.info(result);
vars.put("result", result);
- Add a JSR223 Sampler with below code to write the result in a CSV file. Add a timer inder this sampler with 60 seconds.
def file = new File(vars.get("jmeter.test.home") + "result.csv");
file << vars.get("result") + "\n";
vars.put("result","");
That is it! Run the JMeter test with specific duration (update in the Thread Group settings). JMeter will produce the result like this!
Instead of writing this data in CSV file, We could easily write this in Influxdb (as shown in this article) – create a Grafana chart as shown here.
Happy scraping and Subscribe 🙂
Thank you so much for sharing such an informative post on web scraping. Keep up the good work.