On the shoulders of giants

This expression got fame with Newton in the 1600’s, but it had been used already as early as the 1100’s¹. Here at Bits4Waves we usually don’t immediately dismiss ideas that linger for 1000 years or so—we try to learn from them, if possible! That’s why today’s activity is so gratifying…

We’ve been collecting shortcodes of the posts with the hashtag #100daysofpractice. There are 600k in total, but we could get only 50k (12 times less!).

The process to obtain them used the Python library instaloader, and it was breaking at the 50k mark.

After sharing the issue on instaloader‘s Github, one of the developers was kind enough to help. Applying some advanced wizardry, he cooked a new script using ideas and codes from related opened issues and SHAZAM: we now have 250k shortcodes! It breaks at this point, and I communicated the fact. Let’s hope it’s solvable!

Meanwhile, we have work to do:

  • [X] update the code with the new script

About the code, I had created a new branch for the new script, giving the script also a new different name. As it worked better than the previous version, it could simply replace that one. Let’s do this:

SRC=~/sci/100daysofpractice-dataset/src
pushd $SRC

git -C $SRC rm get-shortcodes.py
git -C $SRC mv get-hashtag.py get-shortcodes.py

This takes care of the renaming. Now we have to check to see if everything can work well with the new script. Let’s start from the beginning: the Makefile.

ifndef IG_USER
$(error IG_USER is not set)
endif

PYTHON=python
SRC=../src
SHORTCODES_ORIG=shortcodes-orig.txt
SHORTCODES_TEST=shortcodes-test.txt
SHORTCODES_SORT=shortcodes-sort.txt
SHORTCODES_UNIQ=shortcodes-uniq.txt

all: shortcodes-orig shortcodes-test shortcodes-sort shortcodes-uniq

shortcodes-orig:
	$(PYTHON) $(SRC)/get-shortcodes.py

shortcodes-test: $(SHORTCODES_ORIG)
	head --lines=10 $(SHORTCODES_ORIG) > $(SHORTCODES_TEST)

shortcodes-sort: $(SHORTCODES_ORIG)
	sort $(SHORTCODES_ORIG) > $(SHORTCODES_SORT)

shortcodes-uniq: $(SHORTCODES_SORT)
	uniq $(SHORTCODES_SORT) > $(SHORTCODES_UNIQ)

clean:
	rm -rf $(SHORTCODES_ORIG) $(SHORTCODES_TEST) $(SHORTCODES_SORT) $(SHORTCODES_UNIQ)

First, let’s fix some issues with Makefile:

  • [X] a fundamental problem with the Makefile: the targets must have the file extension!
  • [X] fix: typos in targets’ names
  • [X] create a link for the final file at the end
  • [X] add variable for link to final file

ifndef IG_USER
$(error IG_USER is not set)
endif

PYTHON=python
SRC=../src
GET_SHORTCODES_PY=$(SRC)/get-shortcodes.py
SHORTCODES_ORIG=shortcodes-orig.txt
SHORTCODES_TEST=shortcodes-test.txt
SHORTCODES_SORT=shortcodes-sort.txt
SHORTCODES_UNIQ=shortcodes-uniq.txt
SHORTCODES_LINK=shortcodes.txt
OBJECTS = $(SHORTCODES_ORIG) $(SHORTCODES_TEST) $(SHORTCODES_SORT) $(SHORTCODES_UNIQ)

all: $(OBJECTS)

$(SHORTCODES_ORIG): $(GET_SHORTCODES_PY)
	$(PYTHON) $(GET_SHORTCODES_PY)

$(SHORTCODES_TEST): $(SHORTCODES_ORIG)
	head --lines=10 $(SHORTCODES_ORIG) > $(SHORTCODES_TEST)

$(SHORTCODES_SORT): $(SHORTCODES_ORIG)
	sort $(SHORTCODES_ORIG) > $(SHORTCODES_SORT)

$(SHORTCODES_UNIQ): $(SHORTCODES_SORT)
	uniq $(SHORTCODES_SORT) > $(SHORTCODES_UNIQ)
	ln --symbolic $(SHORTCODES_UNIQ) $(SHORTCODES_LINK)

clean:
	rm -rf $(OBJECTS) $(SHORTCODES_LINK)

Much better!

Now, it would be nice to

  • [X] unify the old and new shortcodes into a single file

Done!

Finally, let’s make use of all the wizardry we got access to, and try and continue downloading from 250k onwards.

We’ll manually change the session file to make total_index point to 250k. OK, that’s done! Now let’s make it and wait for the results!

See ya!

Published by eglur

I have a B.Sc. in Computer Science and a M.Sc. in Computer Engineering, both from the University of São Paulo, and have been programming for 16 years.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started
%d bloggers like this: