Genome-Wide Organization of GATA1 and TAL1 Determined at High Resolution

Erythroid development and differentiation from multiprogenitor cells into red blood cells requires precise transcriptional regulation. Key erythroid transcription factors, GATA1 and TAL1, cooperate, along with other proteins, to regulate many aspects of this process. How GATA1 and TAL1 are juxtaposed along the DNA and their cognate DNA binding site across the mouse genome remains unclear. We applied high-resolution ChIP-exo (chromatin immunoprecipitation followed by 5′-to-3′ exonuclease treatment and then massively parallel DNA sequencing) to GATA1 and TAL1 to study their positional organization across the mouse genome during GATA1-dependent maturation. Two complementary methods, MultiGPS and peak pairing, were used to determine high-confidence binding locations by ChIP-exo. We identified ∼10,000 GATA1 and ∼15,000 TAL1 locations, which were essentially confirmed by ChIP-seq (chromatin immunoprecipitation followed by massively parallel DNA sequencing). Of these, ∼4,000 locations were bound by both GATA1 and TAL1. About three-quarters of them were tightly linked to a partial E-box located 7 or 8 bp upstream of a WGATAA motif. Both TAL1 and GATA1 generated distinct characteristic ChIP-exo peaks around WGATAA motifs that reflect their positional arrangement within a complex. We show that TAL1 and GATA1 form a precisely organized complex at a compound motif consisting of a TG 7 or 8 bp upstream of a WGATAA motif across thousands of genomic locations.

H ematopoietic stem cells undergo self-renewal and differentiation into many blood cell lineages: erythroid cells (red blood cells), lymphocytes, and myelocytes (including megakaryocytes) (1). Differentiation into red blood cells, referred to as erythropoiesis, requires several transcription factors, such as GATA1, TAL1 (also referred to as SCL), LMO2, LDB1, FOG1, and KLF1 (2)(3)(4). GATA factors are essential for hematopoiesis, as shown by the anemic phenotypes of mouse knockout mutations and the leukemias and lymphomas associated with mutations of the human genes (5). GATA1 is a master regulator of differentiation, proliferation, and apoptosis of red blood cells and megakaryocytic cells (6). Mutations in or misregulation of GATA1 leads to transient myeloproliferative disorder and acute megakaryoblastic leukemia in infants with Down syndrome (6). TAL1, a basic helix-loophelix hematopoietic transcription factor, is required for multiple functions in hematopoiesis, including terminal differentiation of red blood cells (7). TAL1 makes an obligatory heterodimer with E2A basic helix-loop-helix proteins, such as E47, and along with other proteins listed above, nucleates a complex that includes GATA1 bound to its cognate site. An important missing component in our understanding of how GATA1, TAL1, and their cognate DNA recognition motif function is their precise local organization across the genome.
Several ChIP-seq (chromatin immunoprecipitation followed by massively parallel DNA sequencing) studies have been performed on GATA1 (8)(9)(10)(11) and TAL1 (7,11,12) in erythroid cell lines and primary cells as a major step toward understanding the genome-wide binding properties of GATA1 and TAL1 and, ultimately, the biology of GATA1 and TAL1 regulation. ChIP-seq robustly identifies the genomic locations of these proteins and therefore the associated biology. However, the limited resolution of the assay places confidence limits on precise binding locations. Indeed, the broad binding regions of ChIP-seq, which span over 100 bp, make it challenging to decipher the relevant motif when numerous motifs may be present within an occupied segment; they also may not allow any differences in binding patterns and distances between GATA1 and TAL1 binding locations at cooccupied locations to be distinguished.
Estimates of the number of binding locations vary considerably among ChIP-seq studies. Three studies identified 4,000 to 6,000 in vivo binding sites for GATA1 in mouse MEL erythroleukemia cells expressing a tagged form of GATA1 (10,11) or human K562 erythroleukemia cells (9). Another study identified Ͼ15,000 sites occupied in mouse G1E-ER4 cells, where the GATA1 genes had been knocked out of mouse embryonic stem cells (G1E) and then restored under the artificial control of estradiol (8). Recently, ENCODE has reported ϳ12,000 GATA1 and ϳ2,000 to 8,000 TAL1 binding locations in the mouse G1E-ER4 cell system after 24 h of GATA1 activation and 24,000 to 60,000 sites bound by GATA1 in MEL cells (13)(14)(15). While some differences in the num-ber of bound locations may result from occupancy level thresholding, other factors, such as cell type, antibody quality, peakcalling methods, and data quality, might also contribute to the differences. Nonetheless, the substantial number of high-confidence locations has provided substantial insight into the network of genes involved in erythroid development.
GATA1 recognizes the WGATAA (International Union of Pure and Applied Chemistry [IUPAC] consensus) motif, whereas TAL1 and its heterodimeric partner, E2A, recognize the E-box (CANNTG) in solution (16,17). Complexes containing both GATA1 and TAL1 tend to have their cognate motifs spaced 9 bp apart (18,19). However, the determinants of binding may be more complex than those captured by their individual consensus motifs. First, the highly conserved WGATAA consensus sites are insufficient to accurately predict in vivo GATA1 binding (20,21). Second, TAL1/E2A complexes have been suggested to bind to DNA with GATA1 without the need for an E-box (22), although another study found CTG upstream of WGATAA (CTG[N [7][8] ]W GATA), in peaks cooccupied by GATA1 and TAL1 (11). Third, the in vivo developmental functions of TAL1 do not require its DNA binding region (23). Fourth, DNA site selection studies show that, in complex with other proteins, including GATA1, the TAL1/E2A complex prefers to bind GATA1/E-box composite motifs, rather than an E-box alone (18). Moreover, the GATA binding site motif is a stronger determinant of TAL1 occupancy than is the E-box (7,18,19,24). How these erythroid transcription factors are positionally organized around their cognate motif remains unclear and is the focus of this study.
GATA1 activates or represses transcription, depending on the context with other transcription factors. GATA1-TAL1/E2A complexes induce gene expression, while GATA1 without TAL1 represses it (8,11,12,25). However, GATA1 and KLF1 cooccupancy leads to gene activation and may be TAL1 independent. This was indicated by the low overlap between GATA1/KLF1 and GATA1/TAL1 regions ascertained in that study (26). How the interplay and cooccupancy of various proteins in GATA1 multiprotein complexes activate or repress transcription, and therefore regulate the erythroid differentiation program, is not well understood (19). As one step toward this goal, we focused on defining more precisely the structural interactions between GATA1, TAL1, and its DNA binding site on a genomic scale using the ChIP-exo (chromatin immunoprecipitation followed by 5=-to-3= exonuclease treatment and then massively parallel DNA sequencing) assay (27). Since ChIP-exo is a high-precision derivative of ChIP-seq, we expect substantial overlap between locations identified by both methods. As such, the resulting gene-regulatory networks are expected to be the same. Therefore, this study is focused on the structural aspects and parameters of binding discernible by ChIPexo but not by ChIP-seq rather than on gene-regulatory pathways.

MATERIALS AND METHODS
Cell culture. Cells were cultured as described previously (28). G1E and G1E-ER4 cells were grown in Iscove's modified Dulbecco's medium (IMDM) with 15% fetal calf serum, 2 U/ml erythropoietin, and 50 ng/ml kit ligand. To activate the conditional GATA1-ER, cells were cultured in the presence of 10 Ϫ7 mol/liter beta-estradiol for 24 h (29), and G1E-ER4 cells from 0 h, 3 h, and 24 h were obtained during this activation process. Then, sonicated chromatin materials of G1E and G1E-ER4 0-h, 3-h, and 24-h cells were prepared by standard methods.
ChIP-exo. With prepared sonicated chromatin, ChIP was performed on GATA1 (antibody sc265 L1609) and TAL1 (sc12984). Standard ChIP methods were used, followed by lambda exonuclease treatment and li-brary construction, as described previously (30). Sequencing was performed using an AB 5500xl genetic analyzer, Illumina HiSeq2000, and Illumina NextSeq. For single-end reads from HiSeq, base calls were performed using CASAVA version 1.7 (Illumina), and for paired-end reads from NextSeq, base calls were performed using Bcl2fq version 2.15 (Illumina). ChIP-exo reads were aligned to the mm9 genome assembly using Bowtie 1.00 for SOLiD and BWA (version 0.6.2 for HiSeq single-end reads and version 0.7.9a for NextSeq paired-end reads) with default options. Non-uniquely mapped reads were filtered out in order to remove the reads with low mapping quality. The sequencing statistics are reported in Table S1 in the supplemental material.
Determination of binding locations using intersection of peak pairing and MultiGPS (see Fig. 2). Prior to peak pairing, tags from biological replicates were merged after demonstrating their reproducibility. During peak pairing, sequence read distributions were used to identify peaks using the strand-separate peak-calling algorithm in GeneTrack (parameters: sigma ϭ 5, exclusion zone ϭ 10) (31). After peaks within the blacklist regions were removed (32,33), the peaks were paired if the plus-strand peak was within 5 bp upstream or 25 bp downstream of the minus-strand peak. Peak pairs that were enriched Ͼ2-fold over an input control with a q value of Ͻ0.05 were selected (q values are adjusted P values from binomial tests for multiple-hypothesis testing). To collect binding locations that are present at one or more time points ("union" of binding locations), binding locations within 40 bp were determined to be bound at multiple time points, while those more than 40 bp apart were determined to be time-point-specific binding.
MultiGPS is designed to detect binding locations across multiple conditions while characterizing differential binding between conditions (34). MultiGPS detected binding locations across multiple conditions with reads enriched Ͼ2-fold over the input control and q values of Ͻ0.05. The MultiGPS commands were as follows: o -geninfo mm9.info -threads 4 -q 0.05 -d reb1_chipexo.distrib.txt -exclude blacklist.bed -design design_ gata1 -verbose -probshared 0.99 -poissongausspb -medianscale -prlogconf -5 -memepath usr/bin -mememinw 6 -mememaxw 16 -seq mm9. "-q" is for the minimum Q value (corrected P value) of reported binding events, and "-d" is for the binding event read distribution file for initializing models; the true distribution of reads around binding events is estimated during MultiGPS training (52). GATA1 and TAL1 binding locations are reported in Table S2 in the supplemental material.
Homotypic clustering of transcription factor binding analysis (see Fig. 3). Occurrences of the distance between adjacent binding location midpoints were calculated. Binding location midpoints that were closer than 500 bp from the adjacent binding locations were clustered as one binding unit, a homotypic cluster of transcription factor binding, while the median genomic coordinate was taken as the binding location. When a binding location was more than 500 bp from a nearby binding location, it was considered a noncluster. The closest transcription start site (TSS) of the mm9 RefSeq gene was assigned to the binding location as the target gene. Gene expression fold changes were obtained from transcriptome sequencing (RNA-seq) of G1E and G1E-ER4 cells across time points (14,15).
Binding around WGATAA motifs (see Fig. 4 and 5). MEME (multiple EM for motif elicitation) analysis, which finds ungapped motifs in unaligned DNA, RNA, or protein sequences, was performed on 80-bp sequences surrounding the top 500 highly occupied binding locations of GATA1 and TAL1. Occurrences of top motifs were scanned across the binding locations using FIMO (find individual motif occurrences) (P Ͻ 10 Ϫ3 ) to classify the binding locations by the presence of cognate motifs. FIMO analysis searches a sequence database for occurrences of known motifs. This program treats each motif independently and reports all putative motif occurrences below a specified P value threshold. The binding locations were centered on the most significant motifs if a motif was present within a 40-bp distance. When a motif was present multiple times, the most significant motif (with a more significant P value) was chosen. Figures displaying strand-specific sequencing tags of merged time points represent the merged raw data (G1E-ER4 cells at 0 h, 3 h, and 24 h for GATA1, and G1E and G1E-ER4 cells at 0 h, 3 h, and 24 h for TAL1) without normalization.
When sorting the binding locations by occupancy around the WG ATAA motif, GATA1 occupancy was measured as tags within 25 bp upstream to 25 bp downstream, while TAL1 occupancy was calculated as tags within 40 bp upstream to 30 bp downstream.
When 7,927 GATA1 binding sites with a WGATAA motif were sorted by TAL1 occupancy, approximately one-third (2,997 locations) showed cooccupancy of GATA1 and TAL1, while the remainder (4,930 locations) showed only GATA1 occupancy. Then, the frequencies of dinucleotide TG distances from WGATAA were calculated for both GATA1-and TAL1-cooccupied sites and GATA1-only-bound sites. Among the top 2,997 GATA1-and TAL1cooccupied WGATAA sites, TG was present at 1,650 sites between 7 and 9 bp distant from WGATAA. To examine the sequence composition of a full Ebox (NNNNTG), the 4 upstream sequences of 1,650 TG sites were examined. TAL1-only binding locations around the E-box (see Fig. 4A). Cooccupancy of GATA1 and TAL1 binding locations were determined if the binding location midpoints were within 40 bp. When TAL1-only binding locations with a WGATAA motif (FIMO P value Ͻ 10 Ϫ3 ) were removed, the remaining sequences were mostly enriched with the E-box CAGMTG motif. Then, these binding locations were further classified by the presence or absence of an E-box motif within 40 bp of the binding location midpoint.
ChIP-exo and ChIP-seq comparison (see Fig. 9). Overlap between the binding locations of ChIP-exo and ChIP-seq (15) for the same factor in identical cell types was examined. Binding location midpoints lying within a 40-bp distance were defined as overlap. As a measure of consistency, Spearman correlation of log 10 -transformed read counts of ChIP-exo and ChIP-seq within a 400-bp window surrounding binding location midpoints of all ChIP-seq (n ϭ 22,729) and ChIP-exo (n ϭ 10,168) reads were calculated. A heat map scatter plot was drawn using the LSD package of R. Lastly, the distributions of GATA1 G1E-ER4 3-h cell ChIP-exo reads (total tag count normalized) around ChIP-seq-only-bound locations (n ϭ 14,071) were plotted. Fig. 10). EdgeR (Bioconductor [https://bioconductor.org/packages/release/bioc/ html/edgeR.html]) was run internally, along with MultiGPS, to call differential binding events between time points. k-means clustering was performed on log 2 -fold change of GATA1 occupancy between G1E-ER4 cells at 3 h/0 h, 24 h/3 h, and 24 h/0 h. For analyses of occupancy levels, data were normalized so that the total tag counts were equal across all time points of a given factor and mapped around the binding location midpoints. Binding locations were assigned to the closest annotated RefSeq gene to calculate their gene expression fold change between time points (15) and their distances from the transcription start site.

Differential occupancy of GATA1 across time points (see
Microarray data accession number. Sequencing data are available at the NCBI Sequence Read Archive under accession number GSE68964.

RESULTS
In this study, we employed G1E (GATA1 Ϫ ) mouse erythroid cells (35,36). Reintroduction of GATA1 fused to the estrogen binding domain provides a cell system (G1E-ER4) for synchronous and homogeneous erythroid maturation in response to estradiol treatment (28). Here, we examined these cells before and after 3 and 24 h of estradiol treatment.
GATA1 and TAL1 locations determined by ChIP-exo. TAL1 is abundantly expressed in G1E cells, and its production declines very little after activation of the G1E-ER4 cell line (data not shown). Compared to ChIP-seq, the use of exonucleases in ChIPexo substantially reduces background, which often represents a substantial proportion of the ChIP-seq signal (27). Therefore, in addition to its higher precision, a practical advantage of ChIP-exo is that fewer total tags are required to achieve a similar depth of identification of locations (reported in Table S1 in the supplemental material) (27). In ChIP-exo, the 5= ends of the sequencing tags correspond to a point about 6 bp upstream (5=) of a protein-DNA  cross-link where movement of the lambda single-stranded exonuclease is blocked. For a single protein-DNA cross-link, a block occurs on both strands, resulting in two peaks of 5= ends located on opposite strands and separated by about 12 bp in the 3= direction. Pairing of these peaks provides a potential resolution of a few base pairs (27). In ChIP-seq, the 5= ends of sequenced tags correspond to random breakpoints in sonicated solubilized chromatin and have less resolution than in ChIP-exo. Unlike ChIP-exo, ChIP-seq cannot resolve individual points of cross-linking within an individual binding location. Resolution is essential where proteins bind in closely clustered locations. For example, at many genomic locations, ChIP-seq detects GATA1 as a broad peak, whereas ChIP-exo reports many individual peaks, some of which correspond to separate WGATAA motifs (Fig. 1). Even when the signal track for ChIP-seq suggests multiple binding sites, peak-calling algorithms may combine them into a single peak.
As is common with ChIP assays, the signal intensities at many locations may be relatively low, resulting in substantial numbers of false positives. False positives may be reduced by placing tighter positional constraints on relative ChIP-exo peak locations, but this potentially produces more false negatives. To identify potential alternative modes of binding, we allowed some positional flexibility. To balance these opposing stringencies, we implemented and compared two complementary approaches to identify binding locations: peak pairing and MultiGPS (34) ( Fig. 2A). Peak pairing and MultiGPS are commonly used for ChIP-exo and ChIP-seq analysis, respectively, but have not previously been compared.
MultiGPS uses integrated machine learning to call binding  events that are consistent with the learned binding pattern across multiple time points and biological replicates. MultiGPS allows the determination of binding events that are shared across conditions or are condition specific, using the binding pattern profile. Both methods required ChIP-exo signals to be at least 2-fold en-riched over the input control, and a q value of Ͻ0.05 (q values are multiple-hypothesis-testing adjusted P values from binomial tests).
In an effort to examine comprehensive binding locations across all time points, we considered all statistically significant  peak pairs arising from 0, 3, and 24 h of GATA1 activation. Peak pairing yielded ϳ47,000 initial GATA1 candidate cross-linking points (which are not necessarily distinct binding locations) in mouse G1E-ER4 cells, whereas MultiGPS yielded ϳ13,000 initial locations. Often, distinct peak pairs were very close together (Ͻ20 bp), and MultiGPS modeled them as a single location. Consequently, ϳ10,000 (80%) MultiGPS locations contained ϳ20,000 peak pairs (Fig. 2B; see Table S2 in the supplemental material). This intersect was highly enriched with the WGATAA motif (motif P value threshold ϭ 10 Ϫ3 ), thereby providing a general validation of the binding locations. Approximately 80% (7,927/10,290) contained the WGATAA motif. Thus, we detected and further analyzed ϳ8,000 GATA1-bound WGATAA binding sites in differentiating mouse G1E-ER4 cells. These locations were detected by peak pairing and MultiGPS and contained a WGATAA motif and therefore represent a high-confidence set of locations. The remaining ϳ2,000 sites, which were detected by both methods but lacked a WGATAA motif, may include some combination of noncognate DNA interactions, interactions with degenerate WG ATAA motifs that fell below our bioinformatics detection threshold, and false positives.
When comparing the binding locations that were separately detected by MultiGPS and by peak pairing, the outersects, consisting of calls made by only one method, were less enriched with WGATAA motifs (60% and 46% versus 82%). Thus, while highly enriched with true positives, those called by only one method either have a higher false-positive rate or involve noncanonical cross-linking patterns that are not picked up by peak pairing. Representative examples of how calls at WGATAA sites can be made by only one method and not the other are shown in Fig. 2C (for TAL1). Whereas peak pairing could detect simple one-coordinate peaks in each pair, MultiGPS discounted them. In contrast, many binding locations that were detected only by MultiGPS were predominantly enriched with tags on only one strand or had sparsely distributed tags, thereby precluding peak pairing. Moreover, lowoccupancy peak pairs often occurred at noncognate locations in the "shadows" of robust cognate binding events, which MultiGPS rolled into a single location. Also, tags may be piled up on both strands and are detected by peak pairing but not by MultiGPS, since they do not fit the distribution shape that MultiGPS deems to be a consensus. This analysis highlights the advantages and limits of both methods. Conservative estimates of locations might then use only the intersection resulting from the two methods but might have substantial numbers of false negatives, whereas more comprehensive estimates might include the union of the two methods but might have substantially more false positives.
When TAL1 was examined, the intersection of the two location-calling methods resulted in ϳ15,000 candidate locations de-tected in one or more GATA1 activation time points (see Table S2 in the supplemental material). Therefore, MultiGPS peaks that overlapped with peak pairing were taken as higher-confidence binding locations, resulting in 10,290 GATA1 binding locations and 15,209 TAL1 binding locations. These are about 75% and 200%, respectively, of those reported previously by ChIP-seq.
One advantage of ChIP-exo is its ability to detect two closely bound factors that would be called as the same binding location with ChIP-seq data. We therefore examined whether GATA1 ChIP-exo locations occurred in clusters. Among 10,290 GATA1 binding locations, 14% (1,414 GATA1 binding locations) were less than 500 bp from each other, forming 663 GATA1 clusters (Fig. 3A) with approximately two GATA1 locations per homotypic cluster. Surprisingly, a substantial portion (64% [9,796]) of TAL1 binding locations formed 2,905 TAL1 clusters (Ͻ500 bp) (Fig. 3A). On average, approximately three TAL1 locations formed a homotypic cluster. The nearest genes to a cluster showed greater average gene expression change upon GATA1 activation than nonhomotypic clusters for both GATA1 and TAL1 (Fig. 3B), suggesting that clustering, on average, results in higher levels of gene expression.
A structural model for the genome-wide cobinding of TAL1 and GATA1. Though GATA1 and TAL1 have been known to work together, a high-resolution view of their precise positioning within a complex on DNA has not been examined in vivo, particularly on a genomic scale. Further, although cooccupancy by GATA1 and TAL1 is known to activate gene expression, the recognition motif for their cooccupancy still remains poorly identified. To this end, we compared the ChIP-exo binding locations of GATA1 and TAL1 (Fig. 4A) to determine the regions of cobinding and to analyze their underlying DNA sequence. Within a defined occupancy threshold, we identified 3,736 GATA1-bound GATA motif locations that also contained 4,245 TAL1 binding locations within 40 bp. The remaining 6,554 GATA1-bound GATA motif locations contained either no or low (i.e., subthreshold) levels of TAL1. Similarly, the remaining 10,964 TAL1 locations contained either no or subthreshold levels of GATA1. These "TAL1-only" locations were enriched with the same motif (described below) seen for GATA1/TAL1-cobound locations, indicating that a different GATA factor might be bound instead of GATA1. GATA1 is a member of a set of related GATA binding proteins.
In an effort to explore the genomic organization of GATA1 and TAL1 around the WGATAA motif, we plotted GATA1 ChIP-exo tag 5= ends around all 7,927 WGATAA motifs (motif P value Ͻ 10 Ϫ3 ) that were enriched in the 10,290 GATA1 binding locations (Fig. 4B, red-shaded plot). Remarkably, GATA1 displayed a double-peak-pair pattern around the WGATAA motif, with a peak pair located at each end of the motif. This is similar to what we have seen with many other proteins, including CTCF (27) and p53 (37), where cross-linking typically occurs at the edges of protein/ DNA complexes. For GATA1, a major and a minor peak pair were observed, with the two pairs 16 bp apart. These points of crosslinking align precisely with structural models of the DNA binding domain of GATA1 in complex with DNA (shown in Fig. 4B, top), which indicates these models likely reflect the binding structure of GATA1 bound throughout the genome. Cross-linking to the left side of WGATAA was stronger than on the right side, which likely reflects differential reactivity between an appropriate cross-linkable amino acid and the DNA on the two sides. This level of structural congruity between genomic binding events and crystal structures has thus far been described in metazoans only for FoxA1 (38), the glucocorticoid receptor (39,40), and the DMRT transcription factor (41) but to our knowledge represents the first genome-wide structural assessment of multiple components within a heteromeric complex. TAL1 also displayed two major peak pairs, but the pairs were situated to the left of WGATAA sites (Fig. 4B, black trace). Their midpoints were 21 and 13 bp upstream (more 5=) of the WGATAA midpoint. The two inferred points of cross-linking were about 8 bp apart, which agrees with the modeled structure of TAL1/E47 with an E-box (18). A minor level of TAL1 cross-linking was observed on the right side of the WGATAA motif. This cross-linking did not match the double-peak pattern of GATA1 and thus largely rules out TAL1 cross-linking indirectly to DNA via cross-links to GATA1. Instead, we suspect that TAL1, in complex with GATA1 and other proteins, may also be in close proximity to DNA on the distal side of GATA1 (in addition to its main proximal-side interactions).
The positioning of GATA1 and TAL1 cross-linking points remained consistent relative to WGATAA motifs across most locations (Fig. 5A, sorted by GATA1, and B, sorted by TAL1), including whether GATA1 was associated with TAL1. Hence, the genome-wide average was not a skewed representation of a few sites having high tag counts. Low-occupancy locations often did not contain a full complement of GATA1 tags at each of the four consensus peak locations (i.e., two peak pairs). This was evident when sorting by the signal strength of each of the four peaks separately (Fig. 6). Each peak in a peak pair reflected two distinct measurements of the same cross-link and so should have had roughly equivalent tag counts. The variance in tag counts at each of the four peaks within a location may have multiple sources, including statistical sampling, technical variation in library construction, and double cross-links. The last occurs when a single protein-DNA molecule has two cross-links instead of the more common single cross-link. Since GATA1, on average, has two major points of cross-linking, some fraction of bound molecules may have double cross-links. When this occurs, due to the 5=-3= directionality of lambda exonuclease, the more 3= cross-link of a double cross-link will not be observed.
We also addressed whether potential nucleotide bias in crosslinking efficiency might account for the observed range of GATA1 occupancy. This issue is often raised as a hypothetical concern in ChIP studies. The nucleotide sequence in the vicinity of the major GATA1 cross-linking point (the Ϫ8 position from the WGATAA midpoint) deviated little from the overall local average when comparing high versus low GATA1-occupied sites (Fig. 7). The small amount of deviation toward higher GϩC frequency at highly occupied sites was also evident in surrounding regions, indicating that it was not specific to the site of cross-linking. Therefore, we conclude that the detection of different binding site occupancy levels is not substantially influenced by putative base-specific differences in cross-linking efficiency. TG is enriched 7 or 8 bp upstream of WGATAA at GATA1and TAL1-cooccupied locations. We examined the DNA sequence underlying the cooccupancy by GATA1 and TAL1 at all 7,927 identified GATA1-bound WGATAA motifs. Motif locations were first grouped by whether they contained or lacked (i.e., were above or below a set threshold) TAL1 and then sorted by GATA1 occupancy (Fig. 5A). The merged tags of multiple time points (G1E-ER4 cells at 0 h, 3 h, and 24 h for GATA1 and G1E and G1E-ER4 cells at 0 h, 3 h, and 24 h for TAL1) were used to determine occupancy. Where GATA1 and TAL1 cooccupied the same location, their occupancy levels were positively correlated (R ϭ 0.42), indicating that binding events are directly or indirectly linked. A TG dinucleotide motif was enriched upstream (more 5=) of the WGATAA motif (Fig. 5A, top, green/yellow vertical stripe) in the GATA1-and TAL1-cooccupied sites, while no TG enrichment was observed in regions having subthreshold levels of TAL1 (Fig. 5A, bottom). The distance between the two closest ends of the two motifs was 7 or 8 bp (i.e., TGN 7-8 WGATAA).
One interpretation of this TG dinucleotide is that it comprises the fifth and sixth nucleotides of an E-box (CANNTG) that has been previously linked to WGATAA motifs at TAL1-bound sites in vivo (7,11,16) and to gene activation (20). We investigated this possibility by inspecting the remaining sequences of a putative E-box configuration, where the conserved TG represented the most WGATAA-proximal side of the E-box. Indeed, matches to an E-box consensus were the most abundant ( Fig. 8)  Those with an exact match to the E-box CANNTG motif had the central 2 nucleotides as largely KC dinucleotides (K ϭ G/T). Thus, TAL1-GATA1 genomic binding sites are predominantly (MA/BG)N 2 TGN 7-8 WGATAA, but with a bias toward CAKCT GN 7-8 WGATAA. The tight positional linkage of the (MA/BG) N 2 TG motif with WGATAA and the degeneracy of the remaining sequence in relation to a consensus E-box suggest that DNA interactions at these (MA/BG)N 2 TG motifs contribute to TAL1 binding specificity (in addition to direct or indirect interactions with GATA1 bound at WGATAA).
Structural relationship between points of TAL1/GATA1 cross-linking and DNA sequence. We next attempted to orient and position the TAL1/E2A crystal structure (22) to the (MA/ BG)N 2 TGN 7-8 WGATAA motif, as illustrated in Fig. 4B. The two primary TAL1 cross-linking points are located 21 and 13 bp upstream of the A at the WGATAA midpoint. Those cross-links flank positions 1 and 2 of NNNNTGn 7-8 WGATAA (positions 1 and 2 are underlined). Making the reasonable assumption that the TAL1 DNA binding alpha helix is what is cross-linking to DNA (22), the most likely placement of the TAL1 alpha helix is midway between its two points of cross-linking and thus within 1 bp of positions 1 and 2. Note that only a single alpha helix of TAL1 is expected to bind DNA, with the other alpha helix in the structure coming from its E2A partner. If this interpretation is correct, then in the context of GATA1, the occurrence of MA or BG dinucleotides at the site of TAL1/DNA contact (positions 1 and 2) supports the crystallographic model in which TAL1 makes fewer overall contacts with DNA than other bHLH proteins (22). In fact, only a single amino acid (E196) makes base-specific contact, which occurs at N-4 of C (in CANNTG). Where A occurs in place of C at position 1, we speculate that N-6 of A might provide the hydrogen bond contact in place of N-4 of C. How BG dinucleotides are accommodated is less obvious.
Given the placement of TAL1 at positions 1 and 2, we suggest its E2A partner contacts the TG dinucleotide (i.e., its complement, CA) at positions 5 and 6 (22,42) and thus positionally closer to GATA1 along the DNA. This interpretation would place the TAL1/E2A heterodimer in a single predominant orientation (with TAL1 being distal) with respect to GATA1/WGATAA, in addition to being located primarily upstream of WGATAA (i.e., essentially as oriented in the structural model in Fig. 4B).
A second, alternative interpretation of the double peak pairs generated by TAL1 is that one peak pair is from TAL1 and the other is from its partner, in which the heterodimer is directionally oriented relative to the motif (i.e., bound on the 5= versus the 3= side of the motif DNA sequence). We deem this to be a more complex and less likely scenario, since cross-linking detection would require that TAL1 cross-link to its partner and that its partner cross-link to DNA. Due to the inherent inefficiency of crosslinking, this is expected to be a low-frequency situation. A third possibility is that TAL1 and its partner can bind in both the forward and reverse orientations. If the heterodimer were binding in both orientations, then in a population of molecules, the crosslinking points should be equidistant and symmetrical from the E-box midpoint. However, that was not observed. Therefore, we favor a model in which TAL1 and its partner have a directional orientation upstream of GATA1 binding, where TAL1 is the more distal partner, and thus, its binding is not being specified directly by the TG motif.
ChIP-exo versus ChIP-seq in location detection. We compared ChIP-exo to a previously existing set of ChIP-seq locations (15) performed in the same cell system, with the intention of understanding why some locations were called by one method and not by the other. For ChIP-exo, we used the set of locations defined by the intersection of peak pairing and MultiGPS (filtered to be Ͼ2-fold over background; q value Ͻ 0.05). This high stringency gives stronger confidence in the peak calls, but at the cost of some false negatives. The sets of ChIP-exo and ChIP-seq locations showed substantial overlap (Fig. 9A), as expected, since ChIP-exo is essentially a refined version of ChIP-seq. A high percentage (50 to 72%) of ChIP-exo locations that did not overlap contained WG ATAA motifs (motif P value Ͻ 10 Ϫ3 ) Ͻ40 bp away, some of which are likely to be false negatives. This compares with 75 to 84% of ChIP-seq/ChIP-exo intersects having a motif. About half of all ChIP-seq locations that did not overlap ChIP-exo were deemed to be enriched with false negatives by the same criteria. Taken at face value, the nonoverlapping locations (outersects) may be binding events detectable in only one assay. Alternatively, thresholding of the data may result in a set of locations being marginally above a threshold in one assay and marginally below a threshold in the other assay.
We tested this by directly comparing occupancy levels between the two assays. A scatter plot comparison of occupancy levels from ChIP-seq versus ChIP-exo showed that they were well correlated (Spearman R, ϳ0.65) (Fig. 9B) when both assays identified GATA1 locations. The correlation dropped to 0.4 when locations were identified only by ChIP-seq. Locations identified by only one assay typically had low occupancy in both assays. Therefore, the lack of a call in the ChIP-exo assay is more likely due to differences in stringency thresholding between the two assays rather than a qualitative distinction between the ChIP-seq and ChIP-exo assays.
To further address the ChIP-exo/ChIP-seq correspondence of the outersects, we examined the ChIP-exo GATA1 tag distributions around the midpoints of occupied segments bound only by ChIP-seq. On average, the ChIP-exo peak pairs were centered on the ChIP-seq-only midpoint locations (Fig. 9C), demonstrating that they were indeed reporting on similar locations despite being below the detection threshold.
Taken together, many of the locations called in only one of the two assays likely reflect a substantial amount of real but low-occupancy binding events. While such events can be captured, they will also have a higher false-positive rate. The primary difference between called locations in the two assays lies in the level of false discovery (the sum total of false positives and negatives), defined by assay-specific location-calling thresholds. While the use of thresholds adds confidence to location calling by reducing false positives, it comes at the price of creating false negatives.
Dynamics of GATA1 and TAL1 binding during erythroid development. The erythroid developmental program has already been well described based on ChIP-seq. Since the ChIP-exo data we obtained substantially overlapped the ChIP-seq locations, we fully expect that linked genes would also be involved in various erythroid pathways. To confirm this, we examined the changes in   (Fig. 10A). Six kinetic classes were produced by k-means clustering, which were grouped into increased (clusters 1 and 2), unchanged (clusters 3 and 4), and decreased (clusters 5 and 6) GATA1 occupancy. Since GATA1 was ectopically induced, we assume that its total level of genome-wide binding would either increase or stay approximately the same. We normalized the total of all occupancy levels at each time point to be constant across the time points. Therefore, an apparent relative "decrease" in GATA1 occupancy upon GATA1 induction may actually reflect less of an increase than other locations. In general, the kinetic patterns of TAL1 occupancy were similar to those of GATA1, as expected when binding as a complex (Fig. 10A, right). We note that prior to induction (0 h), GATA1 is in both the cytoplasm and the nucleus (42) and thus has a significant level of binding genome-wide. This preactivation state is insufficient to promote differentiation, and it may be inactive until released by estradiol. The closest annotated mouse gene was assigned to each GATA1 binding location to examine the effects of GATA1 and TAL1 binding   on gene expression. These locations were largely associated with genes involved in blood cell development and maintenance, as previously determined (references 35 and 42 and data not shown). The binding locations with the greatest increase in GATA1 occupancy displayed the greatest increase in gene expression of the nearest gene in response to GATA1 activation (15) (Fig. 10B, clusters 1 and 2). GATA1 was closer to their TSS on average than the average of all other genes linked to a GATA1 location that did not experience the same relative increase in GATA1 occupancy (i.e., clusters 3 to 6). Therefore, where GATA1 occupancy differentially increases near a TSS, its gene may be activated during erythroid differentiation.

DISCUSSION
In this study, we determined the genome-wide positional organization of GATA1 and TAL1 at nearly single-base-pair resolution using ChIP-exo. Comparison of location calling using peak pairing versus MultiGPS revealed that they largely call the same locations. Peak pairing picks up low-complexity binding locations that are missed by MultiGPS, whereas MultiGPS picks up locations where tags are missing on one strand (likely due to molecule-specific biases arising during sample preparation and library construction). These are missed by peak pairing. Often the biggest differences in called locations are due to differences in data thresholding (tag counts and patterning). A large fraction of locations typically fall near the threshold, and thus, small differences in thresholding can create an appearance of incongruence between location-calling methods. We similarly compared ChIP-exo to ChIP-seq and found them to be highly similar, with the main differences attributable to differences in false-discovery rates and thresholding.
We identified about 10,000 GATA1-bound WGATAA sites in mouse G1E cells containing ectopically expressed GATA1 and about 15,000 TAL1 locations. About 3,000 of these locations correspond to TAL1/GATA1 complexes bound to a WGATAA motif having a TG dinucleotide enriched 7 or 8 bp upstream. TAL1, and presumably its E2A partner, bind stereospecifically to GATA1-bound DNA. Our assessment of the cross-linking pattern leads us to propose that TAL1 contacts DNA more distally upstream of WGATAA than E2A. We also found evidence of TAL1 being in close proximity to DNA on the downstream side of GATA1, but the structural basis for this is unknown.
The structural model proposed here gives a basis for mechanistic hypotheses about the many functions of TAL1 and GATA1. TAL1 plays essential roles at multiple stages of hematopoiesis, including specification of hematopoietic cell lineages and prevention of ectopic cardiogenesis in the multipotent cardiovascular endoderm (43), establishment of early hematopoietic stem and progenitor cells (44)(45)(46), and differentiation to produce erythroid cells and several myeloid cells (47). These pleiotropic effects are accomplished by dynamic changes in the genome-wide binding profiles of TAL1 during differentiation, leading to TAL1 occupancy at regulatory regions distinctive of each cell type (7,24,33,(48)(49)(50)(51). These large-scale changes in TAL1 occupancy during differentiation appear to be driven, at least in part, by TAL1 binding, together with a GATA factor, which provides a major component of the sequence specificity (7,24,51). Our structural model for sites of cooccupancy shows sequencespecific interaction of GATA1 zinc fingers with the binding site motif, WGATAA, and the E2A heterodimeric partner of TAL1 binding to a partial E-box. However, the interaction of TAL1 with DNA does not show detectable sequence specificity. By assuming a similar structure for TAL1 cobound with other GATA factors (e.g., GATA2 in hema-topoietic stem and progenitor cells and GATA4 in multipotent endoderm), one can envision a mechanism whereby TAL1 binds specifically to a distinct set of regulatory regions in different cell types. The model suggests that TAL1 is not guided primarily by binding its preferred binding site in solution (an E-box) but rather is guided by its interaction with a GATA factor that is strongly bound to its DNA binding site motif. Thus, the cell-type-specific binding by a series of paralogous GATA factors could be a major determinant of the cellspecific binding profiles for GATA1. These sites of cooccupancy by GATA factors and TAL1 are strongly associated with gene induction (12,50,51), a result that is recapitulated by our high-resolution mapping here. Multiple mechanisms of GATA1-dependent repression have been proposed, each applying to a different subset of genes (8,10). During the specification of hematopoietic versus cardiac cell fates in mesodermal cells, TAL1 not only activated genes needed for hematopoiesis, but prevented the retention of active chromatin marks at enhancers needed for the cardiac lineage (51). Our structural model suggests possible explanations for the different effects of TAL1. In the presence of a GATA factor, a structure with TAL1-E2A bound upstream from GATA may serve as a platform for recruitment of activators and coactivators in induction. Once the GATA factor is no longer present, the remaining TAL1 may assume a different position or conformation that interferes with the recruitment of positive regulators, leading to loss of induction.
The high resolution of ChIP-exo allowed us to identify DNA segments bound by multiple molecules of GATA1 or TAL1 in a cluster. These clusters were not resolved by ChIP-seq. The clusters of homotypic binding represent a small but significant subset of all the bound locations. The significance of this is unclear, although it might reflect constraints imposed by higher-order structures that involve multiple copies of GATA1 and TAL1. Our analyses did not identify major functional distinctions between clustered and nonclustered locations, apart from the former being more highly expressed on average. Nonetheless, G1E differentiation along the hematopoietic lineage upon ectopic activation of GATA1 is linked to a substantial increase in GATA1/TAL1 binding to about 3,000 sites genome-wide.