Vikram's Open Diary.

Monday, April 20, 2020

ब्रह्माण्ड में तैरती बातें

कोरोना वायरस महामारी के चलते मैं भी आजकल दुनियाभर में लाखों लोगों की तरह ऑफिस का काम घर से ही कर रहा हूँ, कभी कभी इंटरनेट पे कोई समाचार या ब्लॉग भी पढता रहता हूँ | मुझे ऐसे ही किसी ब्लॉग को पढ़ते हुए देख के 2 -4 दिन पहले मेरी पत्नी ने मुझसे पूछा, अरे आजकल तुम ब्लॉग नहीं लिखते? क्यों ही लिखोगे, पहले तो तुम अपनी वो xyz के लिए लिखते होंगे, मेरे लिए क्यों ही लिखोगे? कौन कौन थीं वह? उसने मुस्कराते हुए मेरी टांग खींचने के लिए पूछा | मैंने कहा, हाँ बिलकुल सही कहा तुमने, बहुत सारी पाठिकायें थी, किस किस का नाम बताऊ ? बहुत दिनों से मेरी किसी xyz की फरमाइश नहीं हुयी, देखते हैं कब आदेश हो, तब तक तो तुम्हें इंतज़ार करना पड़ेगा, यह कहते हुए मैं अपने ऑफिस के काम में लग गया और मैडम नाश्ते की प्लेट मेरी टेबल पे रख के चली गयी |

एक मीटिंग के बाद दूसरी और फिर तीसरी, ऐसे ही ज्यादातर समय ऑफिस के साथियों के साथ गूगल हैंगऑउट पे मीटिंग करते हुए निकल जाता है | ऐसी ही एक मीटंग के दौरान मैंने अपने लैपटॉप का माइक्रोफोन बंद (म्यूट) किया हुआ था और हैडफ़ोन पे मीटिंग की आवाज़ सुन रहा था, तभी किसी का फ़ोन आया और मैं फ़ोन पे बात करने लगा | इसी दौरान मैंने अपने लैपटॉप पे गूगल हैंगऑउट में एक नोटिफिकेशन देखा, जिसका मतलब कुछ ऐसे था - "क्या आप बात कर रहे हो ? आपका माइक्रोफोन बंद हैं |" मैंने सोचा की कम्प्यूटर्स कितने समझदार हो गए हैं, इन्हे म्यूट पे रखो फिर भी यह सारी बातें सुनते रहते हैं और क्या पता सब कुछ बताये या बिना बताये रिकॉर्ड भी करके रखते हों |

कुछ देर बाद मैंने फ़ोन बंद किया और मीटिंग पे ध्यान देने लगा, लेकिन लैपटॉप और मोबाइल फ़ोन्स के द्वारा लगातार रिकॉर्डिंग बारे में सोचते हुए अपने स्कूल टाइम के हिंदी पढ़ाने वाले गुरुजी श्री दामोदर लाल गुप्ता जी की याद आ गयी | दामोदर गुरूजी हिंदी पढ़ाने के साथ साथ पाढ़यक्रम के बाहर की भी कई कहानियां सुनाते रहते थे, ऐसी ही एक कहानी मुझे याद आ गयी जो कुछ ऐसे है |

Image Source: https://images.app.goo.gl/uaUgGuEnTWEeomrY9

एक दिन सूरदास के दोहे पढ़ाते हुए गुरूजी बोले की जो भी बातें हम बोलते हैं यह सभी आवाज़ की तरंगे होती हैं और तरंगों के माध्यम से ही आवाज़ बोलने वाले के मुँह से निकल के सुनने वालों के कानों तक पहुंचती है | अब तो यह बिलकुल सामान्य बात लगती है लेकिन करीब 21 - 22 साल पहले जब मैं 10वीं कक्षा में पढ़ता था तब यह नयी बात ही थी क्योंकि ध्वनि तरंगों के बारे में शायद तब तक सुना नहीं था | गुरूजी ने आगे बताया की आवाज़ की सभी तरंगे अनंत काल के लिए ब्रह्माण्ड में तैरती रहती हैं कभी विलुप्त नहीं होती और वैज्ञानिक ऐसी तकनीक बना रहे हैं जिससे ब्रह्माण्ड में हज़ारों लाखों वर्षों से तैरती हुयी आवाज़ की तरंगो को फिर से ढूँढा जा सकता है और रिकॉर्ड किया जा सकता है |

जब यह संभव हो जायेगा तो हज़ारों लाखों वर्षों पूर्व ऋषि मुनियों ने जो भी मंत्रोच्चार किये थे वह सभी फिर से रिकॉर्ड किये जा सकेंगे | यहाँ तक की भगवान श्रीकृष्ण ने जो महाभारत युद्ध के समय गीता उपदेश दिया था वह फिर से ढूंढ लिया जायेगा और फिर से सभी को सुनाने के लिए उपलब्ध हो जायेगा | एक लड़के ने पूछा, गुरूजी फिर तो तुलसी दास की रामचरितमानस और सूरदास के सारे दोहे भी वैसे के वैसे ही ब्रह्माण्ड में तैरते हुए मिल जायेंगे जैसे उन्होंने बोले थे ? बिलकुल हो सकता है, गुरूजी ने कहा |

तब मैंने पूछा, गुरु जी अब तक इस धरती पे हज़ारों करोड़ों लोग न जाने कितनी बातें कर के जा चुके हैं, वैज्ञानिक उन बातों तक कैसे पहुंचेंगे जो किसी के काम की हों? गुरूजी कुछ बताते उससे पहले ही एक लड़की बोली की जब वैज्ञानिक पुरानी बातों को ढूंढ के रिकॉर्ड कर सकते हैं तो फिर सारी बातों को अलग अलग कैसेटस में भर के भी रख देंगे, ढूंढ लेना जो तुम्हे चाहियें वह, ठीक है ! तभी एक लड़का, जो सबसे पीछे की लाइन मैं बैठा था वह बोला, बाकी सब तो ठीक है, लेकिन गुरूजी बस यह पता चल जाये की हमारे दादाजी अपनी सारी कमाई किस खेत में गाढ़ के गए थे और उधारी के रुपये पैसे किस किस से वापस लेने हैं | यह बात सुनके कक्षा के सभी छात्र छात्राएं हसने लगे |

खैर गुरूजी अपनी बात कहके चले गए और यह कहानी मेरे ध्यान में अभी तक बनी रही | अब करीब 21 साल बाद यह तो बिलकुल साफ़ है की ऐसी कोई तकनीक पे शायद न तो कोई वैज्ञानिक काम कर रहे थे और न ही ऐसी कोई तकनीक अभी तक विकसित हुयी है, लेकिन यह बात जरूर है की चाहे व्हाट्सऐप्प तब न रहा हुआ हो लेकिन जैसे आजकल व्हाट्सऐप्प पे कुछ भी तरह के सत्य और असत्य सन्देश आते रहते हैं ऐसी ही कथा कहानियां से लोगों का मनोरंजन और ज्ञान अर्जन हज़ारों सालों से होता रहा है |

दूसरी तरफ, गुरूजी की कही बात भले ही हज़ारों साल पुरानी ब्रह्माण्ड में तैरती बातों को रिकॉर्ड करने के मामलें में सही नहीं रही हो लेकिन जब गुरूजी यह बातें हमें कक्षा में सुना रहे थे, करीब करीब उसी समय गूगल नाम की कंपनी की स्थापना इस दुनिया में हो रही थी, जिसने पिछले 21 - 22 वर्षों में न केवल दुनिया भर के करोड़ों वेब साइट्स पे उपलब्ध सारी जानकारी बल्कि हज़ारों साल पुरानी किताबें और सैकड़ों साल पुराने अख़बार खोज के समस्त जनता के लिए रिकॉर्ड करके ऐसे रख दिए जैसे की दामोदर लाल जी के काल्पनिक वैज्ञानिक वास्तव में ब्रह्माण्ड में तैरती हुईं बातों को रिकॉर्ड करने में कामयाब हो गए हों |

Monday, November 14, 2016

Android Apps Statistics - Google Play Store - November 2016

I have been watching Android Apps Ecosystem closely over the years and the growth in this space has been tremendous. There are close to one million developer accounts which have published one or more apps on the Google Play Store.

As of Nov 9th, 2016, a total of 7,273,197 apps have been published on the Play Store. Out of these close to 3.48 million apps have been removed from the Play store for the reasons best known to Google, while approx. 900K apps are shown as not available in your country for the users from the USA or South East Asia region.

The distribution of the 2,884,190 apps as per their number of downloads is shown below.

-->

Downloads	Counts
Info Not Available	68,395	68,395	2.37%
1 - 5	216,386
5 - 10	113,642
10 - 50	513,107
50 - 100	252,987
100 - 500	585,501
500 - 1,000	214,647	1,896,270	65.75%
1,000 - 5,000	384,859
5,000 - 10,000	126,189	511,048	17.72%
10,000 - 50,000	224,367
50,000 - 100,000	63,757	288,124	9.99%
100,000 - 500,000	81,889
500,000 - 1,000,000	16,090	97,979	3.40%
1,000,000 - 5,000,000	17,000
5,000,000 - 10,000,000	2,700	19,700	0.68%
10,000,000 - 50,000,000	2,250
50,000,000 - 100,000,000	230	2,480	0.09%
100,000,000 - 500,000,000	159
500,000,000 - 1,000,000,000	14
1,000,000,000 - 5,000,000,000	21	194	0.01%
Total	2,884,190		100.00%

If an app has 100K - 500K downloads, its in the top 5% of all the apps available to download.

Less then 1% of all the available apps have 1 million or more downloads.

These 2,884,190 apps are listed in 48 categories.

Categories	No. of Apps
Education	235,928
Lifestyle	214,473
Business	206,168
Entertainment	205,565
Tools	173,031
Personalization	159,923
Music & Audio	126,248
Books & Reference	123,221
Travel & Local	108,579
Puzzle	94,944
Casual	93,576
Productivity	91,338
Health & Fitness	89,626
Arcade	87,624
Shopping	87,076
Social	73,370
Communication	71,762
News & Magazines	71,367
Sports	68,111
Finance	60,109
Photography	48,840
Maps & Navigation	43,284
Medical	40,868
Action	39,061
Educational	35,223
Video Players & Editors	30,487
Adventure	28,529
Simulation	25,080
Racing	17,242
Trivia	16,888
Food & Drink	14,442
Card	11,631
Casino	10,432
Strategy	10,403
Board	10,158
Weather	10,000
Libraries & Demo	8,850
Role Playing	8,379
Word	7,968
Comics	5,719
Auto & Vehicles	3,835
Music	3,104
Art & Design	2,673
Events	2,514
Beauty	2,140
House & Home	2,076
Dating	1,034
Parenting	820
Info not available	471
Total	2,884,190

Leave a comment if you want to know more ... :)

Thursday, September 17, 2015

Android Apps Statistics - Google Play Store - Sept 2015

One with Special interest in mobile apps, I have been tracking Apps on Google play for a year now.

I've tracked close to 1.7 million apps on Google Play Store. As of 14th Sept'2015. There were 17,00,835 apps on the Play store which my crawler could find.

There may be some more newly launched apps which don't show up in Search or Similar Apps Section on Google Play. It's hard to find these as crawler couldn't trace these easily. So total apps launched on Google play may be around 1.75 millions.

1.42 million apps are live now while 280K apps have been removed by Google from Play Store for some reason!

82% of the live Apps have less than 10,000 downloads.

About 1% of total apps have more than 1 million downloads!

This is where everyone wants to be :)

As per AppBrain (17th Sept 2015) there are close to 1.7 million apps on Play Store. So I'm sure my estimations are pretty accurate!!

Appbrain guys are still counting the dead apps.. as only 1.42 million apps are live and downloadable!

Comment below if you wanna know more about these stats and need to compare your apps w.r.t. others in the category or the Store.

Wednesday, January 21, 2015

Finding Duplicate Voters of Delhi

News flashed about a political party claiming that there are duplicate voters in the voter lists of Delhi. They even showed a few samples on their Facebook and Twitter Accounts.

Is it really the case? I wondered. It certainly can be! Just about every organization struggles in having a single version of a person's identity, be it customers, leads or even its own employees. Even largest online social networking platforms have duplicate and fake profiles. So there is no surprise if the voter lists too have certain number of duplicate or fake voters.

After all these are updated by thousands of government employees, which keep adding, deleting and moving voters from one list to another; even the technology, which is used to store the data, keeps changing, resulting in duplicate and erroneous entries.

Being a professional in analytics industry, I regularly deal with data sets where there may be numerous duplicate or erroneous entries, which need to be removed before some meaningful reports can be generated. So I decided to try my hand in finding the duplicate voters in Delhi's voter lists.

The task was challenging and interesting, and it had to be done! Just imagine finding duplicate voters scattered around in 11,763 voter lists contained in more than 4,00,000 pages. It certainly can not be done manually. Even though the the volunteers of a political party tried to do that, I read.

I used my technology expertise and proceeded with the task. The steps I took are described below -

1. I downloaded all the pdf files (11763 in total) from the website of election commission, by the web-crawler I wrote sometime back.

Pdf file format (Illustrative)

2. Converting the pdf files into text/html files using opensource pdf libraries. Ex. xpdf and pdfminer.

Extracted text file from a pdf file (Illustrative)

3. Parse the text files to extract the voters information in flat file format. Ex. CSV or TXT files.

Parsing a text file to columnar format (Illustrative)

The third part was the most challenging. It took about a weeks late night coding efforts to figure out a way to extract the data and arrange it in columnar format so that it can be pushed in to Databases like MySQL.

There were many improvements, which had to be done during this time to handle the errors, which were present in PDF files.

The result of the task was amazing. I got the details required to find the probable duplicate voters of the NCT of Delhi.

Distribution of number of voters by Age - Delhi

The data in columnar format enabled me to find out duplicate voters across constituencies. The voters which had moved their residences and got new voter ID cards but didn't surrender their old cards, were filtered out in an excel file.

The results were shared with the political party which was interested in the task of finding and removing duplicate voters to strengthen the Democratic process in India. This whole exercise helped them in making a strong case with Election Commission to scrutinize the voter lists and remove the duplicate voters as far as possible !!

Acknowledgement :)

Now the Elections in Delhi are to happen on 7th Feb'15, I hope that the best candidates get elected and people get a stable, inclusive and progressive government.

Sunday, January 18, 2015

Fascination with POS systems !

Every time I head into an Apollo pharmacy to buy medicines or baby stuff for my daughter, I make it a point to check their billing desktop, trying to see, exactly what is there on the screen as the sales person asks me for my mobile number so that he can credit the points earned by purchasing medicines/items into my account. I have even tried to inquire about the software they use for billing and whether the same system is used across the thousands of Apollo pharmacies in India.

The fascination about knowing the software used on the retail pharmacies grew a bit further and I even asked at Guardian, 98.4 and few other pharmacies in Sector 30-31 and Sector 49 of Gurgaon. There was always a new question to the retailer whenever I saw any pharmacy using a software to manage the sales transactions. They answered as well as they can, in some cases they just told that they know only this much and don’t have any further information. However, my curiosity kept on growing and I started asking about, how many products do they have in their pharmacy, how do they input the medicines names, prices and other details in their system. I came to know that Apollo has a centralized database of the medicines and other utilities and every pharmacy outlet just selects the items available with them and updates about the quantity available. They also extract information about the sales amount across different categories like medicines, utilities and returns on daily or weekly basis.

I could realize the strength of the system when I needed the bills of all the medicines purchased in last one year, so that I can submit all the bills for tax rebate on medical bills of up to Rs. 15, 000 which my company allows. Even though most of the pharmacies do provide a slip of items purchased, I didn't have them all with me. So, I just inquired at the Apollo Pharmacy, if I could get the bills from my Apollo account? The helpful person at the outlet, asked me for my mobile number and told me to come in the evening to collect the bills. I went in the evening and got a big role of paper bundled by a rubber band. I was surprised that I have purchased medicines worth over Rs. 10, 000 in less than a year.

Thanks for the software system on the Apollo pharmacy, I could get all the bills for my office use.

This experience got me interested in all the desktops kept on any sort of merchant outlet, so be it local grocery shop or easyday outlets. I made it a point to see which software/hardware systems are being used at the sales counter. The machine used by our neighbor shop has a tough name – it had “bizerba” written over it. In a rather free afternoon I asked about the price of the machine and if this is connected to the PC which was kept beside it and whether the whole system was connected to internet. The boy on the sales terminal told that the machine is a second hand item and it had cost over a lakh rupees. The machine is not connected to the PC and they manually enter the daily sales, item wise from a slip which the machine prints. I thought - won’t it be nice if the machine is directly connected to the PC?

Next I checked the easyday market sales/billing terminals. They have “wincor-nixdorf” POS system. The name was tough and I could not remember it for many times I saw it. Finally I could remember the name and searched about it. The interest has led me to believe that POS systems have more potential than just billing the items and printing a receipt for a customer. The internet dimension to these POS systems can really change the Customer interaction with the merchants, business owners and the persons at the shop-floor and billing terminal! There will be a day when I’ll not have to go to get my bills from the outlets… I’ll have them in my mobile with a tap on the screen.

Saturday, November 22, 2014

Making decisions for your Heart - Let Google/Facebook do it !

Its vary natural for some people to build a liking for a person after talking for a few days and the mind can wander a lot in all directions, exploring numerous imaginative possibilities.

However, before your mind lets you go on your heart's way.... ask Google/Facebook, Airtel/Vodafone etc.... about the real situation...

Shown below is an analysis of my own chat history during 2011-2012 (Bachelorhood days :D) ...

Which I did around two years back...

Number of lines of chat on a particular day

So what did the Google Say ?... which one is the promising case? and what did my mind say? leaving for your imagination ... :)

How about an app to help you out ?

Monday, November 10, 2014

Designing a simple Web Crawler

A few months back I had taken a virtual machine on Windows Azure for one month trial, to run an experimental project on web-crawling. The experiment was a good learning experience, more than anything, it helped me appreciate the technology infrastructure Google, Yahoo and MSN etc. might have. I couldn't crawl even a single website completely, while they continue to crawl billions of web-pages regularly. In this post, I’ll describe the way I went ahead in crawling more than 100 million public webpages by a machine with a core-2 duo processor and 3.5 GB RAM, till the trial period lasted.

I had used a popular C based library called libcurl to fetch the web pages and wrote a wrapper in C++ to parse the HTML pages.

Just like any web-crawler, the process started with a list of seed URLs, which may or may not result in new URLs. As you might have seen, most of the websites have many pages and inbuilt links to many other web-pages. These web-pages may be on the same website domain or may be part of other websites. The process of crawling is simple in logic - fetch a webpage, look for new URLs in its html (source code) and add the new URLs in the ever growing list which you have to fetch.

I started the crawler with a seed URL list of a few public web-profiles of a popular professional networking site. As every page has links to close to a dozen other profiles, it wasn't a problem to find new profile pages to add to the URL list.

The crawler started with a call to fetch one webpage at a time. Every request used to take around 5-10 seconds to completely fetch the webpage and parse its content. I was able to fetch close to 600 profiles in first hour. However it was very frustrating to see such a slow rate of crawling. After a few hours of search I came to understand the multiple request function of libcurl, which means that we can fetch up to 1000 pages in a single call. This could have increased the crawling rate very significantly and make the whole exercise meaningful.

As it was the wfirst time that I was using libcurl, it was taking time to explore its features, so I spent a weekend understand it and embedding its functionalities in my code.

So with the new crawler, I started to fetch 500 URLs at one go. As anyone can expect, the web-requests were blocked by the target website in just 2-3 minutes after a few attempts, however the multiple requests were successful, and it was able to fetch the pages successfully. Now the task was to run the crawler without getting blocked.

There can be two ways to do it, first is to hit the target website at a very low rate i.e. just a few hundred pages per day so that it doesn't take you as a threat and continue to respond to your requests. Another way is to send your requests such that these appear to come from disparate sources at a significant delay, so that target website doesn't get alarmed by an automated robot.

I chose the second option. I browsed many websites which provide proxy server IPs to make a list of around 3000 proxies and added a function in the code to continuously check the status of these proxies, 15-20% of these proxies were working regularly. The proxies which were working on my target website were used to fetch the webpages; those which were not active were removed and added at the end of the queue. In this fashion, the crawler was able to run continuously.

There were a few occurrences when the application will crash suddenly, after many debugging hours, I came to realize that the crash was happening in a libcurl function call, which I didn't have much control over as I was using it from the libcurl library. This was quite troubling as I had to restart the application once it crashes, that means I could not leave it running alone. So I found a solution to monitor the running status of the application through WMIC. If the application wasn't running than it launched the application, allowing me to sleep peacefully.

Next challenge came in handling the lists of URLs which were already fetched and the ones which were in the queue to be fetched. Every new URL had to be checked whether it was already fetched or already present in the queue to be fetched. When the list exceeded around 15 million URLs, it came to my notice that 3.5 GB RAM was not able to handle these lists. Since it was necessary to check whether a URL had been fetched already to avoid the repetition? I also had to see that in how many pages a URL has been found, giving it a rank, more the occurrences of the URL, better its rank and higher the priority of it in the queue.

After a few days of continuous crawling, it was a necessity to use a regular database to handle the URL queues. I chose SQLite because of its easy interface with C++. The process continued for over two-weeks, filling the hard drives close to 2 TB, with the fetched web-pages and the parsed content in the SQLite Database. The HTML parser was also written in C++, I tried doing that in Python, but it was relatively slower than C++, so I continued with C++.

Now that the trial period was about to be over as just 3 days’ worth of money was left in the trial account. I thought of taking the parsed content out of the virtual machine and store it in some safe place where I can access it later on. However, there was no way I could store close to 2 TB html content somewhere free of cost.

I thought of uploading a few GBs of meaningful data to an online drive. So I dragged the data file to the online drive and went to sleep. Next morning, what I saw was unbelievable! The trial was over and only a few GBs of data reached the online Drive. I could not log into the virtual machine as the trial was already over because of lack of credit in the trial account.

The cause was that, while the data download to the virtual machine was at very inexpensive cost, the data upload out of the machine was quite expensive. It sucked out all the remaining hundreds of rupees.

The trial account was closed abruptly, leaving all the parsed data and SQLite database files in the remote cloud server!

However, to me it was an amazing learning experience of doing a grain of what Google etc. do on a daily basis.

Thursday, June 19, 2014

Birthday Distribution - How common is your birthday?

The date of birth marked on a high school mark sheet is treated as official birth date in India. So I tried to see how are the date of births on the mark sheets of Rajasthan Board students are distributed across the days of a year.

I had used around 3 million students DOBs for the analysis.

While, to me, at first look, the numbers look quite skewed, as July alone accounts for over 21% people's birth.

1st July, the day when the government schools open after summer holidays, seems to be the most popular birthday!

1, 5, 10, 15, 20, 25th days of a month seems to be most popular while choosing a day to be born !! I leave it for you to trust the official date of births to be original.

Date of Birth Distribution - Heat Map

*I had removed 13 records where the date of births were inconsistent such as 31st June.

**An excel sheet got corrupt - taking away data for around 6,00,00 students of year 2012 :( :(

Tuesday, May 13, 2014

Strange Pattern in the Rajasthan Board Results

Rajasthan Board does it again...

Rajasthan Board declared the result of Senior Secondary Board exams for Commerce and Science streams.

None of the students who passed the board exams got 44.8% and 59.8% marks this year !!!

They jump the normal distribution curve to push the students into First and Second Divisions :)

Same strange phenomenon happened in the results of year 2013!!

Results of Year 2013:

http://justanytime.blogspot.in/2013/10/securing-first-division-by-jumping-bell.html

http://justanytime.blogspot.in/2013/08/get-sessional-marks-for-free-in.html