I started learning web scraping about a month ago. Now I have some data stored as CSV file a.csv, containing 9 columns and about 11 million rows in total, including all the urls I want to extract data from. For some of the urls I have visited and got data from them, stored as another CSV file b.csv, which contains 2 column and about 4.5 million rows in total. For some reason like 403 error, 404 error, there are still about 6.4 million urls I want to visit, but there are buried in the huge a.csv file, now I have a problem, how to remove the rows from a.csv which also contains the 2 columns in b.csv?
My first though was to iterate through all lines in a.csv and check if each existed in b.csv, but it took about 1 or 2 minutes to process just one line, which was obviously not desirable.
Then I tried to import all lines into MySQL and got interested records using “NOT EXISTS” syntax, but it ran overnight and was still running this morning, and I didn’t even know the progress or how much longer it would need to finish this query.
After I woke up this morning, an idea came to my mind: hashed search is much faster as I read somewhere. Sorry I didn’t graduate from a CS major so I don’t know if it is the correct terminology.
Then I read through b.csv and construct a tuple with the 2 values from each row, and use this tuple as a key in a dictionary called target. Then iterate through all lines in a.csv and construct another tuple with the 2 values from interested columns (out of 9 columns) in each row, and then check if it exists as a key in the target dictionary. If it does not, write the row to a new file. It went pretty well.
This might be just a trivial problem, or not even a problem for a CS major, but it bothered me a whole day yesterday. Neither Google nor SO helped me much to the right path, or perhaps I didn’t use the right keywords, ’cause I wasn’t quite sure how to ask this question. I guess I need to learn more about data structure or algorithm from now on.