gender bias on hospitality networks: a case study
some gaita sanabresa. today this post is shaped like a scientific article well, because it is. if you want to see the results first check the link in the abstract. apparently, CS is not that misogynous after all.
abstract
much has been said about gender bias in hospitality networks, and much of it has been mostly speculation. it has been frequently said that males on these networks exploit them as a way to meet females. therefore, i hereby present a quantitative analysis of a specific case study: the rate of replies to males and females on last minute groups on the website CouchSurfing. the final charts can be seen on this page
methodology
all data collected was done so via publicly available data, much like a search engine would do it. in the cases where users choose to make their data private, or groups choose to require a login, no data is collected.
the data was collected using a data mining technique, using PHP as the main language. this data was then dumped onto a TSV file that contained the timestamp of collection, number of replies, the gender of the person sending the message, and the country and city of the group. group id, message id and user id were also collected for consistency.
data set choice
last minute groups are characterized by being focused on a city or country, so local groups were chosen from various global locations. the software was developed so that more groups can be added easily. in this case, 36 groups were chosen. after mining, 22 groups were left from 18 different countries. data from blocked groups was not gathered since it is not publicly available. 5 pages from each group were collected, totaling 1760 messages.
data analysis
once the TSV file was populated, it was then fed to a SQLite relational database, with a primary key on the postid, so that duplicate posts were not taken into account.
the data was then grouped by gender, gender-country, and gender-country-city tuples, calculating the rates of reply, maximums and averages. the data was then processed into an HTML page that charts all of this information. note that when gender is unknown that can both mean the user did not fill it out or has chosen not to have it publicly available.
for visualization, a bar chart was drawn with the maximum length as the maximum average response rate of all groups by category, allowing for visual comparison of different groups or countries directly by visual inspection.
results
the results can be seen in this html page or embedded below if your browser allows it
discussion
world data
the world data was very leveled, indicating that there is little gender bias in general, though women do tend to have a slightly higher response rate than a general user. this indicates, though not strongly, that being of the female gender favors the response rates expected by a user on this network.
country and city data
the country and city data is perhaps the most interesting part of this study, as very strong contrasts are seen between different groups. for example, India showed 3 times the average response rate to females, versus 1.4 times the average response rate for males, indicating twice the response rate to females versus males. on the other hand, Luxembourg had 3.33 times more replies to males than females (which were 1.81 times the world average), signaling that in Luxembourg it was twice more likely for a male to get a reply than a female. in general, cultural differences are very strong between countries in regards to gender bias: some countries are very gender-sensitive, while others are not significantly sensitive.
conclusions
this demonstrates what is frequently said about gender bias in a global perspective, that females are favored by their gender in finding a place to stay. however, if the data is split regionally, this no longer adds up. the culturally different attitudes towards gender seem to be stronger than what common sense would claim, disproving the hypothesis that females are always favored by their gender. females are favored by their gender in some cases, in others not, so this is sufficient to disprove that there is widespread gender bias on CS. it exists, but it is confined to specific geographical locations, with wildly varying amounts, and not in a very significant way at all.
note that this data is biased by the fact that we only analyze people that couldn’t find a couch and are using last minute groups, which in itself cannot be used to generalize further, though it is already indicative of no significant global gender bias.
sources for replication
i provide all sources and the database collected for download freely, as long as the license of this website is respected. all scripts are prefixed with a shebang line so that they can be used in a shell environment. if your php-cli is located elsewhere, you should change that line. code is not commented, it is too small to be complicated, but if anyone needs assistance just comment below.
- [SQLite relational database file](http://ubuntuone.com/p/raR/)
- [TSV file of the data mined](http://ubuntuone.com/p/raS/) (works with excel/open office)
- [data mining PHP script](http://ubuntuone.com/p/raU/) (includes a simple way to add more groups. it randomizes the request interval to avoid firewalls and cause server problems)
- [TSV to database script](http://ubuntuone.com/p/raa/) (converts data from one to the other, effectively filling up the database)
- [HTML code generator](http://ubuntuone.com/p/rab/) (generates the html graphs from the data present in the database)
- [HTML output example](http://ubuntuone.com/p/rac/) (the output of the software for the data in this study)
comment
it is a bit unsettling to me that all this data is made publicly available by CS without any control. this has to do with the default privacy setting: people share everything with the world unless they choose not to. this means that most people that aren’t particularly tech savvy will end up sharing more than they would expect. this is becoming a trend in online websites, share everything by default, which, in my opinion, sets a dangerous precedent. not everyone is like me and is doing this to test out scientific hypotheses. this information can be easily exploited commercially, with some 30 mins of coding like i did.
this study also demonstrates how informatics can be helpful in social sciences, and that with a little bit of coding one can get huge datasets automatically, ready for processing. it took me 30 minutes of coding to set up the mining, left it overnight to crawl the website, and then about 1h to setup the visualization. it should be interesting to see this data on a map, but for the sake of my free time, i’m not going to do it.