We report here on the first genome-wide association study (GWAS) in families with multiple members with schizophrenia. Significant associations of single-nucleotide polymorphisms (SNPs) can suggest new disease susceptibility mechanisms. For schizophrenia, large GWAS analyses of common SNPs have found associations in the major histocompatibility complex (MHC, chromosome 6) (1–3) and several specific genes (3–5). The Psychiatric GWAS Consortium (PGC) analyzed 21,856 individuals from 17 GWAS samples and then added data from an additional 29,839 individuals (including the present data set) for the most promising findings. The results strongly supported association in seven genes or regions between genes, including the MHC (6). The present study was designed before the typical effect sizes of common SNPs on disease risks became clear (e.g., odds ratios of only 1.1–1.2, conferring a 10%–20% increase in risk), and our sample is now known to be underpowered. However, we can address whether SNPs with larger effects might be “enriched” in families with multiple cases.
The PGC analysis (6) also confirmed a previous finding (1) that is interpreted as suggesting a polygenic effect of many common SNPs on schizophrenia susceptibility, based on the ability of association test results for many SNPs in one data set to predict case versus control status in a second data set. In the present study, we evaluated whether common risk SNPs in multiply affected families are likely to overlap with those in unrelated cases by testing whether our family study results can predict case-control status in the large PGC data set. We also explored whether any known functional gene pathways are enriched for modestly significant SNP associations. In single-SNP, polygenic, and pathway analyses, family data provide some protection against spurious associations due to case-control differences in ancestral backgrounds, because counts of SNP alleles that are transmitted from parents to ill offspring are contrasted with counts of the alleles that parents did not transmit.
GWAS analyses have also shown that rare chromosomal deletions of chromosomes 1q21.1, 15q13.3, and 22q11.2 and of exons of the neurexin-1 gene (NRXN1) and duplications of 16p11.2 (collectively present in around 1.25% of cases) each produce significant eightfold or greater increases in risk; notably, each has also been reported in autism, mental retardation, and epilepsy (7). We determined the frequency of these copy number variants (CNVs) in our families and examined how well they correlate (segregate) with disease in families, which has implications for diagnostic testing. We also identified new “candidate” CNVs.
The sample (Table 1) includes seven subsamples that were recruited for linkage studies (8–15) and subsequently combined (16–19), excluding families from the National Institute of Mental Health’s Schizophrenia Genetics Initiative because a previous GWAS studied the probands (2). Briefly, family members gave informed consent and were diagnosed using semistructured interviews, psychiatric records, and informant reports. Case subjects had DSM-III-R diagnoses of schizophrenia or schizoaffective disorder (probands had schizophrenia), which cosegregate in families (20) and are difficult to differentiate reliably (21). These families were originally ascertained because the constellation of affected relatives was informative for linkage studies, and all families had at least two directly evaluated narrow-diagnosis cases. For some families, only one affected case subject was included in this analysis, either because there was only one case subject in the nuclear family who met inclusion criteria or because DNA was not available for GWAS genotyping or the specimen failed quality control filters. Families were analyzed here if they had DNA available for one affected offspring plus one or both parents, for two affected siblings and at least one parent or one unaffected sibling, or for three or more affected siblings. Some families included more than one sibship that met these criteria. Based on an analysis of power versus cost (not shown), we included all available parents plus two unaffected siblings (if available) if no parents were genotyped, or one unaffected sibling if one parent was genotyped.
+
Genotyping, SNP Quality Control, and Genotypic Ancestry
Genotyping was performed with the Illumina 610-Quad array (at Illumina, Inc., La Jolla, Calif., for families and at the Children’s Hospital of Philadelphia [by H.H.] for control subjects; see p. 18 of the online data supplement for discussion of the CNV case-control analysis), and genotypes were called with the BeadStudio software package (Illumina, Inc.). HG18 genomic locations are reported. Based on principal components analysis (22) of 55,010 autosomal SNPs with low pairwise linkage disequilibrium (LD), families were divided into six ancestry groups (Table 1; see also Figure S1 in the data supplement that accompanies the online edition of this article): European, Mediterranean (primarily Sephardic Jewish), and four with varying degrees of African or South Indian admixture (Réunion Island). Because somewhat different genetic architecture has been observed for schizophrenia in European- and African-origin samples in previous single-SNP (2) and polygenic (1) GWAS results, separate analyses were carried out for the European-ancestry group and for the six ancestry groups combined.
Exclusion criteria for SNPs were as follows: third allele observed; pseudo-autosomal or mitochondrial; minor allele frequency <1% (in European-ancestry group or all founders); call rate <98.8%; p<0.0001 for deviation from Hardy-Weinberg expectation (in unrelated unaffected individuals); GenCall10 quality score <0.55; and more than four Mendelian inconsistencies for parent-child pairs and more than seven for parent-parent-child trios. Genotypes were removed for the family for SNPs with Mendelian inconsistencies and for males for chromosome X SNPs called as heterozygous. There were 576,976 autosomal and 15,146 chromosome X SNPs before quality control analysis (QC), and 531,195/12,936 for European-ancestry and 528,297/13,202 for all analyses after QC.
DNA sample exclusion criteria were as follows: duplicates of another sample; genotypically inconsistent with known gender or family structure; >104 parent-child or >199 parent-parent-child Mendelian inconsistencies: call rate <98%; or mean heterozygosity inconsistent with ancestry subgroup. Chromosome X data were excluded if genotypic gender was ambiguous (possible cell culture artifact) but autosomal QC was acceptable.
+
Statistical Analyses of Genetic Association to SNPs
Family-based association tests were performed using TRANSMIT, version 2.5.4 (23), for autosomal SNPs. TRANSMIT was selected because it is fast and can handle any constellation of genotyped relatives. However, it is not recommended for chromosome X, so UNPHASED, version 3.1.5 (24), modified for consistency with TRANSMIT in handling ungenotyped individuals, was used for that chromosome. These programs test whether each SNP allele is transmitted more or less often than chance expectation. Because they use data set allele frequencies as well as the family’s data to estimate nontransmitted alleles of ungenotyped parents, analyses were performed separately for each of the six ancestry subgroups. European-ancestry and all-family results are reported (with the latter combining observed and expected transmission counts across groups). Autosomal odds ratios were estimated by subtracting an estimate of the number of homozygous parents (allele frequency squared, times the number of parents) from the total number of transmissions of each allele to obtain transmissions from heterozygous parents (expected to be 50% for each allele by chance), and computing the ratio of counts for the two alleles. Genomic control lambda was computed as the median chi-square value divided by the expected value (0.456).
Two previous studies noted that TRANSMIT can sometimes inflate type I error (25, 26). One of the studies (26) is difficult to generalize because it used TRANSMIT’s bootstrapping routine to compute p values, which can produce discrete distributions in small samples (37 pedigrees in that study). For the robust variance estimator used here to compute p values, Martin et al. (25) previously clarified that the problem was seen in larger samples when only two affected siblings could be genotyped, in the presence of linkage, and for recessive inheritance with much larger effect sizes than are observed in any GWAS of schizophrenia. We excluded sibling-pair-only families. Also, we initially evaluated TRANSMIT’s type I error rate in 5,000 replicates of our European-ancestry pedigrees for each of a range of minor allele frequencies and linkage models (up to a value of 2 for the relative risk to siblings versus population risk, much stronger than is realistic for schizophrenia) and observed no inflation of type I error rate at nominal significance levels of 0.05–0.001. Finally, our quantile-quantile plots (see Figure S2 in the online data supplement) demonstrate that no substantial inflation occurred.
To estimate power, genotypes were simulated for European-ancestry families under a range of genetic models, and each replicate was analyzed with TRANSMIT. The sample was well powered (>80%) to detect genome-wide significant association for additive allelic relative risks of approximately 1.5 (25%–50% allele frequencies), but not in the range of 1.1–1.2 (1%–2% power to detect genome-wide significant effects).
We performed ALIGATOR (27) analyses of whether gene pathways contained SNPs with low p values more often than would be expected by chance given the observed distribution of SNP p values, for the GO, KEGG, MGI, PANTHER, BioCarta, and Reactome databases plus two locally curated pathways (see p. 12 in the online data supplement).
We used polygenic score tests (1) to evaluate the hypothesis of multiple common risk SNPs, using 112,869 post-QC autosomal SNPs with limited pairwise LD (r2<0.25) that were also available for the PGC phase 1 European-ancestry data set of 9,394 cases and 12,462 controls (using data that were either genotyped or imputed [28] based on HapMap 3 reference haplotypes with information content >0.9). A reference allele for each SNP was assigned a weight equal to the log-odds ratio for association in the family study. For each PGC subject, the observed reference alleles were weighted and summed. The significance of the PGC case-control score difference was analyzed by logistic regression (using the R package), corrected for seven ancestry-based principal component scores as covariates. The proportion of variance explained (R2) by the polygenic scores was computed by subtracting the Nagelkerke R2 attributable to ancestry covariates alone from the R2 for polygenic scores plus covariates. The analysis was repeated 10 times, starting with only the SNPs with the best 0.01% of p values in the family data, and finally including all SNPs (see Figure 2 legend for details).
Finally, the 58 independent (r2<0.2) SNPs with the best p values in the phase 1 PGC GWAS (which did not include the present families) were selected for analysis of consistency of direction of effect in the family study (6). These were drawn from the 81 SNPs with p<2×10−5, including only the best SNP from the extended MHC region that contained most of the significant SNPs but is characterized by extensive LD. For SNPs not genotyped here, we selected a nearby proxy (highest r2 with the PGC SNP). After inverting the family study odds ratios when necessary because of differences in chromosomal strand and/or test allele, we determined the number of SNPs with the same direction (both odds ratios <1 or both >1) in the two analyses and computed a binomial test of the probability of observing at least that many consistencies, given the chance expectation of 50% consistency of direction of effect.
Data are presented here for segregation of previously identified schizophrenia-associated CNVs within families (chromosomes 1q21.1, 15q13.3, 16p11.2, and 22q11.2 and NRXN1) (7, 29–31). An exploratory case-control analysis to identify new candidate CNVs was also carried out (for the methods and results, see p. 18 of the online data supplement). CNVs spanning three or more probes were called with the PennCNV software program (32). Subjects were excluded if they had ≥50 CNV calls or if the standard deviation of the log(R) ratio (a normalized expression of relative probe intensity for a given subject, which is related to copy number) was >0.4 (indicating increased signal variability across all probes). CNVs were merged if two or more adjacent deletions or duplications had different estimated copy numbers (0 and 1 for deletions, 3 and 4 for duplications) or if a segment with an estimated copy number of 2 contained <30% of the probes in a CNV formed by merging it with two surrounding deletions or duplications (and these merger rules were also applied to chains of such events). For subjects with one of the schizophrenia-associated CNVs and for all of their family members, CNV data for that region were visualized by plotting log(R) ratio and B-allele frequency (the proportion of intensity detected for a designated test allele) and by computing and visualizing point-by-point estimates of copy number using a second algorithm (33). In all cases, the PennCNV call for these large CNVs was confirmed by these additional steps. For the five selected CNV regions, we then examined evidence for transmission within families and for segregation with schizophrenia.
+
Association of Common SNPs
For European-ancestry families (Figure 1), lambda (the median chi-square divided by the expected median in null data, 0.456) was 1.025 (see Figure S2 in the online data supplement), indicating minimal technical or ancestry-related artifact. Table 2 lists results for genes with at least one SNP with p<0.0001 within the gene or within 50 kb of it. (See Table S1 in the online data supplement for details of nongenic regions meeting this criterion.) The all-family analysis produced similar results (see Figures S2 and S3 and Table S2 in the online data supplement). No SNP achieved genome-wide significance (p<5×10−8) in either analysis.
In polygenic score analyses (Figure 2), family-based results significantly predicted PGC case-control status for all thresholds, with the lowest p value of 1×10−17 (explaining 0.4% of the variance) achieved for 34,937 SNPs with p<0.2 in the family study.
PGC and family study odds ratios were in the same direction for 37 of the 58 tested SNPs (one-sided binomial p=0.024) (see Table S4 in the online data supplement), or 29/45 after excluding proxy SNPs with r2<0.8 (p=0.036).
ALIGATOR analyses (see Tables S5 and S6 in the online data supplement) did not detect significant pathway effects (single pathways or excess of number of pathways) after correction for multiple testing.
+
Previously Documented CNV Regions
Figure 3 illustrates eight pedigrees with CNVs with previous significant evidence for association with schizophrenia (7). We observed 1q21.1 and 15q13.3 duplications segregating with schizophrenia in offspring, but only the reciprocal deletions have been strongly associated in these regions, with weaker evidence for 1q21.1 duplications (7). One of two affected offspring had an exonic NRXN1 deletion, but not the unaffected father (the mother was unavailable). For 16p11.2, duplications were observed in an unaffected mother and two of three affected children. The recruiting site reported a duplication in an unaffected sibling (not genotyped here) (34). It is unlikely that the affected father, who was deceased, carried the same rare CNV. Four cases had 22q11.2 deletions (three typical 3 Mb and one proximal 1.5 Mb), all de novo. Excluding the 15q duplication, these CNVs were seen in seven of 633 families (1.1%), compared with 1.3% of cases in a recent meta-analysis (7). No large 3q29 deletions or exonic VIPR2 duplications were observed (7).
Our results suggest that there is substantial overlap between the common SNPs that confer schizophrenia risk in multiply affected families and in unrelated cases, based on the highly significant polygenic score analysis: when association test results from the family study were used to weight the genotypes of PGC subjects, the resulting polygenic scores significantly differentiated case subjects from control subjects. Note that this result does not prove that there are no genetic effects that are individually stronger or more prevalent in multiply affected families.
It has been proposed that this cross-study consistency is due to a large number (perhaps many hundreds) of risk SNPs in the genome (1, 35). In very large samples, the best results will contain some true associations; for example, in the PGC two-stage analysis of single SNPs, seven chromosomal regions ultimately produced highly significant results, drawn from 58 independent SNPs in the best 53 regions of association in stage 1 (6) (most of them with consistent directions of effect in the family sample). Here, with a small predicting sample, the polygenic score analysis became significant as the proportion of best SNPs included in the analysis increased from 0.1% to 1%, but it was most significant using the best 20%, and in the PGC analysis (with a much larger predicting sample), significance continued to improve when all independent SNPs were included. This suggests that risk SNPs are distributed across the range of p values (or odds ratios), because most of them gave quite small individual effects. Polygenic score analysis cannot currently determine which SNPs are truly involved in risk. Here, network-based analyses did not further define the polygenic effect, and it is likely that an increased understanding of gene and protein functions and interactions will be needed to accomplish this.
The actual proportion of variance in PGC case-control status that could be explained was quite low (0.4%). The variance that can be explained by this type of cross-data set analysis is limited by the need to use only independent SNPs in the analysis, by the fact that GWAS assays do not provide information about all common SNPs, and by loss of information as a result of differences in genotyping methods and ancestral backgrounds of samples. Other forms of analysis suggest that common SNPs actually explain around 20%–30% of the genetic variance for schizophrenia (1, 36). Polygenic score analyses of case-control samples have predicted larger amounts of variance as the predicting sample size has increased, from around 4% with prediction and test samples with approximately 3,000 cases (1) to approximately 7% with a larger predicting sample (around 6,500 cases) and a test sample of approximately 3,000 cases. Here, we used the smaller family sample for prediction to the larger PGC case-control sample, because there is no current method for computing polygenic scores for individual subjects based on family data with some parental genotypes inferred rather than directly observed. Therefore, while our results demonstrate a highly significant overlap in common risk SNPs in these families and the PGC case sample, we cannot determine whether there is any reduction in overlap in multiplex families compared with unrelated cases.
It has been suggested that this polygenic signal could be due in part to weak correlations between common SNPs and nearby rare SNPs or structural variants with larger effects on risk (37). Most evidence does not favor this hypothesis (35); for example, we have not found single families with significant linkage signals that might be produced by rare, heritable large-effect variants. The next generation of sequencing-based studies might shed more light on the genetic effects of various types of sequence and structural variants across the full range of frequencies.
We did not observe larger effect sizes of single SNPs in these multiply affected families than have been reported in case-control samples (www.genome.gov/gwastudies, accessed May 7, 2011). Because exonic deletions in NRXN1 are the only single-gene mutations shown to be associated with large increases in schizophrenia risk (approximately eightfold) (7), we were interested to note that several SNPs with low p values were in or near genes with related functions involving brain development and neuronal cell adhesion and signaling (CNTNAP5, CADM2, ERRB4, PPFIA2, PTPRN2, CLEC4D/E, AMIGO3, and CNTN5 for all ancestries). However, we did not detect statistically significant evidence for association of any defined pathway after correcting for multiple testing of pathways. This could be due to lack of statistical power from the relatively small sample size or because the pathophysiological mechanisms underlying schizophrenia risk are not adequately captured by current pathway definitions.
Five rare CNVs are strongly associated with schizophrenia, and three of them (16p11.2 duplications, 22q11.2 deletions, and NRXN1 exonic deletions) were observed here, along with duplications that are reciprocal to associated deletions of 1q21.1 and 15q13.3; there is some evidence for association of 1q21.1 duplications, but not for 15q13.3 duplications (7). The total frequency of these CNVs (excluding the 15q13.3 duplication) was similar to that observed in previously reported case samples. The family data provide several insights. First, the possibility of a de novo (nontransmitted) 22q11.2 deletion should not be ignored in multiply affected families—indeed, the prevalence of these deletions was similar to that reported in large samples with primarily nonfamilial cases (7). There must have been other genetic or nongenetic risk factors in these families, but it is not known whether their effects were limited to the siblings without a 22q11.2 deletion or whether they also influenced the emergence of the schizophrenia phenotype in the carrier, given that schizophrenia develops in only ∼30% of 22q11.2 carriers. Second, two transmitted CNVs (16p11.2 duplications and a NRXN1 deletion) failed to segregate perfectly with schizophrenia within the family, suggesting again that other risk factors were present.
This GWAS of multiply affected families produced significant support for a polygenic model that posits that multiple common SNPs confer part of the genetic risk of schizophrenia, with a significant overlap between common risk SNPs in multiply affected families and samples of unrelated case subjects. Significant association was not detected for any single SNP, which is consistent with the relatively small sample size, but for the most significant SNPs in the large PGC GWAS analysis, the direction of effect was the same in both samples for a significant excess of SNPs. Several of the “top SNPs” in the family study were in genes related to neurodevelopment, but no statistically significant evidence was observed for association of currently defined gene pathways. Rare CNVs were observed in regions with strong previously documented association with schizophrenia, but with variable patterns of segregation. This should serve as a reminder that we still know relatively little about the distribution of these CNVs in the entire population (e.g., in individuals with no or only mild cognitive problems) or about the reasons for the emergence of schizophrenia in only a minority of carriers, so great caution is required in genetic counseling and prediagnosis.
The authors are grateful to the many family members who participated in the studies that recruited these samples, to the many clinicians who assisted in their recruitment, and to the Schizophrenia Psychiatric GWAS Consortium for use of GWAS data for the polygenic score analysis.