philosopher bagpiper

gender bias on hospitality networks: a case study

some gaita sanabresa. today this post is shaped like a scientific article well, because it is. if you want to see the results first check the link in the abstract. apparently, CS is not that misogynous after all.

abstract

much has been said about gender bias in hospitality networks, and much of it has been mostly speculation. it has been frequently said that males on these networks exploit them as a way to meet females. therefore, i hereby present a quantitative analysis of a specific case study: the rate of replies to males and females on last minute groups on the website CouchSurfing. the final charts can be seen on this page

methodology

all data collected was done so via publicly available data, much like a search engine would do it. in the cases where users choose to make their data private, or groups choose to require a login, no data is collected.

the data was collected using a data mining technique, using PHP as the main language. this data was then dumped onto a TSV file that contained the timestamp of collection, number of replies, the gender of the person sending the message, and the country and city of the group. group id, message id and user id were also collected for consistency.

data set choice

last minute groups are characterized by being focused on a city or country, so local groups were chosen from various global locations. the software was developed so that more groups can be added easily. in this case, 36 groups were chosen. after mining, 22 groups were left from 18 different countries. data from blocked groups was not gathered since it is not publicly available. 5 pages from each group were collected, totaling 1760 messages.

data analysis

once the TSV file was populated, it was then fed to a SQLite relational database, with a primary key on the postid, so that duplicate posts were not taken into account.

the data was then grouped by gender, gender-country, and gender-country-city tuples, calculating the rates of reply, maximums and averages. the data was then processed into an HTML page that charts all of this information. note that when gender is unknown that can both mean the user did not fill it out or has chosen not to have it publicly available.

for visualization, a bar chart was drawn with the maximum length as the maximum average response rate of all groups by category, allowing for visual comparison of different groups or countries directly by visual inspection.

results

the results can be seen in this html page or embedded below if your browser allows it

discussion

world data

the world data was very leveled, indicating that there is little gender bias in general, though women do tend to have a slightly higher response rate than a general user. this indicates, though not strongly, that being of the female gender favors the response rates expected by a user on this network.

country and city data

the country and city data is perhaps the most interesting part of this study, as very strong contrasts are seen between different groups. for example, India showed 3 times the average response rate to females, versus 1.4 times the average response rate for males, indicating twice the response rate to females versus males. on the other hand, Luxembourg had 3.33 times more replies to males than females (which were 1.81 times the world average), signaling that in Luxembourg it was twice more likely for a male to get a reply than a female. in general, cultural differences are very strong between countries in regards to gender bias: some countries are very gender-sensitive, while others are not significantly sensitive.

conclusions

this demonstrates what is frequently said about gender bias in a global perspective, that females are favored by their gender in finding a place to stay. however, if the data is split regionally, this no longer adds up. the culturally different attitudes towards gender seem to be stronger than what common sense would claim, disproving the hypothesis that females are always favored by their gender. females are favored by their gender in some cases, in others not, so this is sufficient to disprove that there is widespread gender bias on CS. it exists, but it is confined to specific geographical locations, with wildly varying amounts, and not in a very significant way at all.

note that this data is biased by the fact that we only analyze people that couldn’t find a couch and are using last minute groups, which in itself cannot be used to generalize further, though it is already indicative of no significant global gender bias.

sources for replication

i provide all sources and the database collected for download freely, as long as the license of this website is respected. all scripts are prefixed with a shebang line so that they can be used in a shell environment. if your php-cli is located elsewhere, you should change that line. code is not commented, it is too small to be complicated, but if anyone needs assistance just comment below.

[SQLite relational database file](http://ubuntuone.com/p/raR/)
[TSV file of the data mined](http://ubuntuone.com/p/raS/) (works with excel/open office)
[data mining PHP script](http://ubuntuone.com/p/raU/) (includes a simple way to add more groups. it randomizes the request interval to avoid firewalls and cause server problems)
[TSV to database script](http://ubuntuone.com/p/raa/) (converts data from one to the other, effectively filling up the database)
[HTML code generator](http://ubuntuone.com/p/rab/) (generates the html graphs from the data present in the database)
[HTML output example](http://ubuntuone.com/p/rac/) (the output of the software for the data in this study)

comment

it is a bit unsettling to me that all this data is made publicly available by CS without any control. this has to do with the default privacy setting: people share everything with the world unless they choose not to. this means that most people that aren’t particularly tech savvy will end up sharing more than they would expect. this is becoming a trend in online websites, share everything by default, which, in my opinion, sets a dangerous precedent. not everyone is like me and is doing this to test out scientific hypotheses. this information can be easily exploited commercially, with some 30 mins of coding like i did.

this study also demonstrates how informatics can be helpful in social sciences, and that with a little bit of coding one can get huge datasets automatically, ready for processing. it took me 30 minutes of coding to set up the mining, left it overnight to crawl the website, and then about 1h to setup the visualization. it should be interesting to see this data on a map, but for the sake of my free time, i’m not going to do it.

bagpipe festival this saturday

just a short notice. this saturday is the FEST-i-GAITA 2011 bagpipe festival in lisbon. there will be workshops and documentaries during the day and concerts at night. me and several other students of the bagpipe school will perform to fill in between sets. the video is of the band of the association i’m part of. the line up includes Remi Decker, Volta e Meia and Roncos do Diabo

minds as information machines, part 2

some asturian gaita again, this time a pipe band

on part 1 we first discussed the broad definition of observation and mind. today we’ll continue to more complex structures. to summarize, the first observations are the physical patterns due to reality’s laws of nature (the mindless observers of reality), and the first minds are the ones that change their physical structure by using the laws of nature as the feedback loop for their structure.

as we identify more and more complex structures, the pattern is the same: represented information and replicated information through work on itself/others. this work can be done by consequences of the structure itself. for example, a cell whose work is basically chemical and thermodynamical with no “brain” controlling it (we’ll define this soon). we can say that the reason why these regulation processes that allow this to happen exist are the consequence of an evolutionary thought that led there. note that this definition of “thought” is simpler than usual: it is the processing of information through work. it makes no distinction if this processing is done using forces outside of the system (laws of nature for example) or internal forces (the biased expression of internal energy of the system). so a cell thinks through evolution and this allows it to adapt.

as these structures develop growing internal information (quantifiable as the total information of its constituents, which is the information of the arranged molecules), they can interact with similar, work capable, structures. i’m not adding a definition, but merely identifying another feedback that can happen: not only laws of nature apply, but also other things can do work to change information in other things. the chemical communication between cells is an excellent example of this and has incredible effects. if the work loop can be done with “faster” laws, i.e., if information can be processed faster than through evolutionary time, then thoughts can occur more frequently and evolution can move faster. this is evident in single cell organisms that react to their environment and that leave chemical cues to their partners. sexual reproduction is another good example of the first multiple agent thinking. sex is effectively a non-conservative transformation of information: we feed a and b in, and we get a part of a and a part of b mixed in an unpredictable way. this is another type of information processing.

in general, and layer blindly (this is very important), i will therefore give the following definitions:

an observation is a non-random arrangement of a subset of basic elements of reality (as a written description of a sunset is to the random arrangement of the letters it was written with);
a thought is a transformation of information from one state to another (this information can be internal or external and transform it into internal and external information, in any order);
a mind is anything capable of producing work that creates thoughts and/or observations

or, in mathematics just to annoy everyone again, let reality R be the set of all existing elements (does it include itself? not going there, this is an axiom, remember?).

observation: $O \in R$ such that $I(O) > I(O_e)$ , where $O_e$ indicates the elements of O and I is information (that arrangement is less likely than many random ones of its elements);
thought: $f: (O,R) \rightarrow (O',R')$ , a function t that takes internal observations O and reality R (note that O is inside R, making the distinction unnecessary, but used for clarity), and generates a new arrangement of O and R, O’ R’ (note that this kind of definition is applicable to almost any physical system that does a work cycle);
mind: the set $M = O \cup f$ , where O is the observation and f is its thought function.

now, this is a bit idiotic again. i am using an abstraction to define O and R as separate things, but it should be obvious that one is a partial copy of another. there is only R, and O is in it. the only apparently special thing is the f, that is hard to understand why it is there. we can both say it is part of R or of some other kind of medium outside R. in my case, i prefer to say these f s are just R ‘s constituents interacting and affecting one another with their own properties. this, obviously, could be a lengthy discussion, so i’ll leave it there. O and f are subsets of R.

now that we formalized thoughts, minds and observation as simpler things, we can understand them in a broader sense. i will tackle bigger and bigger structures in the next parts.

minds as information machines, part 1

more portuguese gaita. as promised a long time ago, we will begin exploring the implications of this information model in minds. we will start with simple, brainless, minds. note that this means that my definition of mind is a bit broader than usual.

previously we saw how structure is a property of the arrangement of things. this structure can be quantified using information theory, which actually measures a quantity similar to entropy. we will avoid adding wholes to our parts, since after what i explained previously, that would create issues with infinite information quantities. the information of a whole is the information of its parts. we will consider this as the base principle from now on.

we discussed how complexity can be quantified, but we didn’t discuss how it can be created. this stretches back to some of my early posts. through work (in the physical sense), we can increase the structure of things, provided that this work is fed by some external energy source. gravity for example, during transitory astronomical stages (like the accretion period of planet formation), clumps these things together into more specific arrangements of things. it is arguable whether this is the first case of work or if it is just a property of reality. but in practice, besides clumping things together, it increases the information of a given region in space, versus every other. for example, in the volume of the solar system, information is present in high density areas (planets, sun) and low to no density (empty space). some excellent questions pop up, such as dark matter and so on. all principles are consistent if instead of using matter, we use some other, lower level, organization quantity. for the sake of the argument, it is irrelevant whether dark matter exists or not (but not for the absolute quantities of information).

we can hardly call gravity a mind, or our planet a mind, but it is an example of structured matter that tends to become more and more structured, and by analyzing its structure, we can know factors of the external reality. for example, if a planet could think (which it can’t), it could tell that its heavier bits were more to its center, and that its lighter bits were more to its edge. this implies that there is something that causes these differences, and that the planet actually represents information about its reality. i.e., the matter of the planet is affected by external things (e.g., gravity, electromagnetism), and this causes its shape to change, representing the consequence of these external forces. this means that a planet is a crude, but subjective, observer of its reality. why subjective? because different sections of space have different elements, and planets cannot observe (incorporate) elements that do not exist within its gravitational pull. now, it does not process it, i.e., doesn’t do work on its own structure (e.g., a planet doesn’t suddenly turn all its iron and nickel into hydrogen out of free will), yet the reality around it has consequences on it and these define its own information, versus a random arrangement. so arranged matter is a mindless observer of reality, in the sense that it only collects information about reality (the information collected is its own particular arrangement), but does not act on it (does not do work cycles to change this information).

for example, three things emerge from elements a and b, aa, ab and bb. we know that ab, thanks to the electrical force, will be able to remain together. we also know that aa and bb can’t stay together for long in their environment for the same reason. this means that in the next nearest moment, it is more likely to find the arrangement ab than aa or bb. reality has shaped the structure of these things by virtue of its own laws, and by consequence, ab not only exists, but any other group doesn’t. this narrowed (or structured) the things themselves into a more specific arrangement. it also means that ab has in it an observation of the laws of reality around it, it is a mindless observer of reality.

this brings us to the simplest, and the first, mindful observers of reality. the difference between a mindless observer and a mindful observer is that the latter can do work to change its structure or its environment’s, versus being passively changed by reality. i will start with self-replicating molecules. a self-replicating molecule has both a particular structure and the structure that causes it, thanks to external reality, to replicate. i.e., it is an observer (collects information from reality in the form of its particular constituents and their positions), and it is an agent (by being immersed in an environment it is capable of affecting its structure and the structure of things around it). this implies that its actions have a prior knowledge of reality and how to affect it. by simply copying itself and making mistakes, a molecule will optimize its structure versus its environment for the simple fact that the ones that don’t optimize their structure versus their environment won’t be able to copy themselves. these primitive minds don’t think, thinking is the act of processing internal and external information into different internal and external information. in this case, the thinking is done by the laws of nature. this might seem confusing, but let’s see an example.

two things, c and d, are immersed in reality and made of a and b. thing c, thanks to its molecular structure, can, through the physical interactions occurring around it, take its two constituents from the environment and cause them to turn into another c. thing d cannot. start with a “bath” of many a s and b s, and one c and one d. as a s and b s bump into each other and into c and d, whenever c, a and b are together, another c is formed. no such thing happens with d. whenever c is formed, an a and a b are consumed. so our soup of letters soon will have many c s and only one d. if there is any chance of d breaking down into a s and b s (dying), it will again be more likely for it to become a c than a d. what we see here is the laws of nature doing the work that represents the thought. this simple thought is no more than the information required to process information flowing from reality and back: a and b come together close to c, another c emerges. this implies that c not only has information (a and b), but also changes information around it (causes other a s and b s to turn into c s). since it is not capable of doing this on its own, the thoughts are carried by the forces of nature. but this simple thought could be written as “if a and b are close to me they will become c”, and it occurs whenever a b touch a c and turn it into c. i separate knowledge (internal information) and thought (work done on information) from each other because a self replication might not need all its information to do work. for example, it may be that only b causes c to appear, but since a is required to make a c even though it doesn’t contribute to its replication, it gets copied too, i.e., b does all the work, but needs an a to make a c.

this demonstrates the first working mind, using reality as the carrier of its thoughts. i’ll give a slightly more elaborate example, that i referred previously. a sunflower has in it the information required to make a sunflower, and its structure interacting with the environment cause it to replicate, we saw above how this works. genes and cells are like the above example, they use time as the extra dimension for their thought process. but it also has a solar-tracking feature that i want to use as an example. does the fact that a sunflower track the sun mean it “knows” where the sun is? according to my definition of minds, yes! the sunflower has: a) information about the world around it; b) does work according to that information working on itself accordingly and/or the world around it.

as an argument for a), consider an alien from a starless planet could use the sunflower as a way to know what a star is by simply analyzing the bit of its constituents that reacts to sunlight and makes it grow faster. the alien could induce that the plant was in an environment where sunlight existed, even though he never saw one. and though the alien might induce an incorrect description of the sun observed by the sunflower, he could do better than guessing.

as an argument for b), consider that the sunflower cells grow faster on the areas excited by the sun, making it turn. now, it turns because these areas grow faster when in sunlight, but the reason why they grow faster is because, by thinking using evolution, the plants that turned did better than the ones that didn’t. this thinking was done over many iterations of its structure until it reached this point, where the implicit understanding that the sun moves is can be induced from the explicit motion of the plant. if a plant didn’t understand the sun and its motion, it could not turn accordingly. now, it doesn’t fully understand the sun (neither do we), since it is still subject to, for example, being fooled by artificial human lights. but we have to understand that evolutionary thoughts take thousands of generations to reach conclusions. so it would be like learning how to read in english and then being given a transliterated japanese text and say “but they are the same letters”. the letters here are light, english is the sunlight, and japanese is the artificial light. since through evolutionary thought the sunflower only learned english, it won’t learn japanese instantly. but if given long enough, it might.

i know that observer planets and thinking plants and molecules sounds a bit exotic and silly. so i’ll finish for now. we are not dealing with elaborate thoughts. in fact, if you take the sunflower example, its thoughts would be something like “sun is here” “sun is there” “sun is nowhere”. not very elaborate thoughts, but they are proto-thoughts nevertheless, that themselves require some subjective internal representation of the world and action according to this interpretation. my opinion is that by broadening the definition of thought and mind, it might be easier to understand more complex structures. we’ll do that in the coming parts.

future stats and studies

asturian and galician gaita. CS means couchsurfing

recently CS changed their search algorithm. almost instantly i went from a couple of requests a day (1 to 3) to zero. in fact, now we (me and T) get less than 5 a week the two of us combined. this we can only explain by the changes in the algorithm. but what this means is that i can’t generate enough data to be statistically significant, therefore, i’ve concluded my studies on requests and stays. what remains is data analysis of the past (which is already a huge dataset).

this is also a reflection on the current state of CS. the quality of its members is decaying at inverse proportion with the number of members. we now live in a much nicer place and we get more people creeped out than we did in the previous two (yes, that includes the dog-shit-everywhere squat). with this change in algorithms, it’s interesting to see what will happen.

previously, there was a positive feedback effect on being a good host or guest: you’d get listed above and with it, you’d get more requests and/or more hosts. this meant that everywhere you’d find nodes of CS where you have few, very passionate and active members, with an enormous quantity of references and experience.

with the current algorithm, as far as i could investigate, they leveled the field for everyone. i agree with the idea behind it: allowing everyone to be able to host and surf as easily as everyone else. but my guess is that this will bring the average stay quality down, just by exposing guests to everyone, rather than “professional” hosts like top hosts usually are. they replaced a meritocracy with a democracy.

i expect to be slowly (and naturally) marginalized as time moves forward with this algorithm, since lisbon now has over 2000 people registered, making me a 1/2000 voter in a 2000 population, irregardless of the fact that i’m among the top 50 hosters in the world right now (in 2000000+ people, i was the 14th most experienced, considering data from today). my contribution to this community will slowly be eroded as time goes by. i have ambiguous feelings towards this that haven’t matured yet so i don’t really know how i feel about this. i guess it’s good to kick out the bittered hosts, since all hosts bitter up at some point after too many guests (any good data on this?).

i also noticed a bias towards people that choose to make their personal information public. people that show off everything about themselves to the world (including google), will get listed over other people. this is an interesting tweak that probably offsets some of the effect i described above. CS has always capitalized on ego and self promotion, so maybe this shift will upset some users, but actually make the website more usable. we’ll see. but for now, no more experiments on request rate and so on.

Previous Page 13 of 23 Next Page